|
KING Tutorial: Ancestry Inference
The ancestrial group of each study sample can be reliably and rapidly identified in KING.
The ancestry inference for all samples usually takes only a few minutes, ideal for big datasets such as the biobank data.
Here, the ancestry inference is considered as a separate application from the population strucutre analysis.
In addition to the principal components (PC) generated from the population structure analysis, an ancestrial label as well as its posterior probabilities are also assigned to each sample.
This explicit ancestrial assignment can often help the downstream QC analyses as well as various statistical analyses.
Download & Installation
The ancestry inference in KING requires the KING excutable (freely downloadable from the KING Download website),
which is the same as any other KING applications. In addition, installation of R with the e1071
package is also required for the Support Vector Machine (SVM) analysis which is crucial to the ancestry inference in KING.
General Input Files
The input files for ancestry inference include 3 reference files (KGref.bed.xz [489MB],
KGref.fam.xz [3KB], KGref.bim.xz [37MB]) and the actual study files,
both in PLINK binary format.
The provided reference files include KGref.bed, KGref.fam, and KGref.bim,
and an example of the study files is ex.tar.gz [1MB], which includes ex.bed, ex.fam, and ex.bim.
Both datasets should be specified as input files in KING through option -b, separated with comma (without spaces):
prompt> king -b KGref.bed,ex.bed
One strength of reading in multiple datasets in KING is that the same SNP (defined by the SNP name, not by the positions) across different datasets can be properly merged,
following rules such as:
1. SNPs with unambiguous allele labels can be auto-flipped before merging
2. SNPs with ambiguous allele labels (i.e., A/T, or C/G) are excluded
3. SNPs with inconsistent allele labels (e.g., >3 alleles) are excluded
KING Command for Ancestry Inference
The ancestry inference in KING is able to identify the most likely ancestral group(s) for each study sample,
by leveraging known ancestry in a reference dataset, such as the 1000 Genomes Project data as recommended here.
The superpopulation groups that can be inferred include AFR, AMR, EAS, EUR, SAS,
for African, American, east Asian, European, and south Asian, respectively.
The actual ancestry inference in KING involves a single command line as in one of the following few variations:
prompt> king -b KGref.bed,ex.bed --pca --projection --rplot
prompt> king -b KGref.bed,ex.bed --pca --projection --rplot --prefix ex
prompt> king -b KGref.bed,ex.bed --pca --projection --pngplot
Among the three variations above, the second command specifies the prefix of the output files (i.e., as ex),
and the third command specifies the use of PNG plot (instead of the PDF plot generated through --rplot).
Here, the option "--pca --projection" projects each of the ex samples to the 1000 genomes PC space,
and then generates the projected PCs for each of the 332 ex samples.
The "--rplot" or "--pngplot" option takes PCs generated by KING as the input data, and
then carries out a machine learning algorithm by running a separate R code.
The screen printout is like this:
KING 2.2.7 - (c) 2010-2021 Wei-Min Chen
The following parameters are in effect:
Binary File : ../data/KGref,../data/ex (-bname)
Additional Options
Close Relative Inference : --related, --duplicate
Pairwise Relatedness Inference : --kinship, --ibdseg, --ibs, --homog
Inference Parameter : --degree, --seglength
Relationship Application : --unrelated, --cluster, --build
QC Report : --bysample, --bySNP, --roh, --autoQC
QC Parameter : --callrateN, --callrateM
Population Structure : --pca [ON], --mds
Structure Parameter : --projection [1], --pcs
Disease Association : --tdt
Quantitative Trait Association : --mtscore
Association Model : --trait [], --covariate []
Association Parameter : --invnorm, --maxP
Genetic Risk Score : --risk, --model [], --prevalence, --noflip
Computing Parameter : --cpus
Optional Input : --fam [], --bim [], --sexchr [23]
Output : --rplot [ON], --pngplot, --plink
Output Parameter : --prefix [ex], --rpath []
KING starts at Thu May 27 18:29:56 2021
Read in PLINK fam files
../data/KGref.fam...
../data/ex.fam...
PLINK pedigrees loaded: 2741 samples
Read in PLINK bim files
../data/KGref.bim...
../data/ex.bim...
Genotype data consist of 16824 autosome SNPs
PLINK maps loaded: 16824 SNPs
Read in PLINK bed files
../data/KGref.bed...
../data/ex.bed...
PLINK binary genotypes loaded: 2741 samples
KING format genotype data successfully converted
Options in effect:
--pca
--projection
--rplot
--prefix ex
PCA projection starts at Thu May 27 18:30:43 2021
2409 1000 Genomes samples are detected and used as reference.
Preparing matrix (2409 x 2409) for PCA...
16824 SNPs are used in PCA.
SVD starts at Thu May 27 18:30:44 2021
LAPACK is being used...
Largest 10 eigenvalues: 2059.73 1366.88 700.55 618.08 276.33 265.47 238.72 217.65 198.71 189.81
Projecting 332 samples starts at Thu May 27 18:30:44 2021
PCA projection ends at Thu May 27 18:30:44 2021
10 principal components saved in file expc.txt
Ancestry populations are inferred as in ex_InferredAncestry.txt
Ancestry plots are generated in ex_ancestryplot.pdf
KING ends at Thu May 27 18:30:56 2021
The ancetry inference results are in the form of both tables and plots. The ancestry table ex_InferredAncestry.txt may look like this:
FID IID PC1 PC2 Anc_1st Pr_1st Anc_2nd Pr_2nd Ancestry
1328 NA06984 -0.011 0.0268 EUR 0.9934 AFR 0.0032 EUR
1328 NA06989 -0.0104 0.0276 EUR 0.9962 AFR 0.0019 EUR
1330 NA12335 -0.0109 0.0267 EUR 0.9948 AFR 0.0024 EUR
1330 NA12336 -0.0101 0.0277 EUR 0.9965 AFR 0.0019 EUR
1330 NA12340 -0.0105 0.0288 EUR 0.9958 AFR 0.0023 EUR
1330 NA12341 -0.0102 0.0265 EUR 0.9924 AFR 0.0036 EUR
1330 NA12342 -0.0106 0.0279 EUR 0.9958 AFR 0.002 EUR
1330 NA12343 -0.0104 0.0273 EUR 0.9953 AFR 0.0023 EUR
1334 NA10846 -0.0115 0.0271 EUR 0.9959 AFR 0.0019 EUR
In the example above, the first 10 samples are all inferred as European, with probability > 99%.
Here is the visualization of ancestry for the ex samples:
References Other Than 1000 Genomes
KING provides users a convenient way to infer ancestry.
However, other well-charactered datasets can be used as reference data as well.
For this purpose, we also provide stand-alone R codes to go along with KING to allow even more flexible ancestry inference.
The stand-alone R code is available at GitHub.
REFERENCE
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM
(2010) Robust relationship inference in genome-wide association studies.
Bioinformatics 26(22):2867-2873
[Abstract]
[PDF]
======================================
Last updated: May 28, 2021 by Wei-Min Chen
| |