KING Tutorial: Ancestry Inference


The ancestrial group of each study sample can be reliably and rapidly identified in KING. The ancestry inference for all samples usually takes only a few minutes, ideal for big datasets such as the biobank data.

Here, the ancestry inference is considered as a separate application from the population strucutre analysis. In addition to the principal components (PC) generated from the population structure analysis, an ancestrial label as well as its posterior probabilities are also assigned to each sample. This explicit ancestrial assignment can often help the downstream QC analyses as well as various statistical analyses.


Download & Installation

The ancestry inference in KING requires the KING excutable (freely downloadable from the KING Download website), which is the same as any other KING applications. In addition, installation of R with the e1071 package is also required for the Support Vector Machine (SVM) analysis which is crucial to the ancestry inference in KING.


General Input Files

The input files for ancestry inference include 3 reference files (KGref.bed.xz [489MB], KGref.fam.xz [3KB], KGref.bim.xz [37MB]) and the actual study files, both in PLINK binary format. The provided reference files include KGref.bed, KGref.fam, and KGref.bim, and an example of the study files is ex.tar.gz [1MB], which includes ex.bed, ex.fam, and ex.bim. Both datasets should be specified as input files in KING through option -b, separated with comma (without spaces):

  prompt> king -b KGref.bed,ex.bed
One strength of reading in multiple datasets in KING is that the same SNP (defined by the SNP name, not by the positions) across different datasets can be properly merged, following rules such as:
1. SNPs with unambiguous allele labels can be auto-flipped before merging
2. SNPs with ambiguous allele labels (i.e., A/T, or C/G) are excluded
3. SNPs with inconsistent allele labels (e.g., >3 alleles) are excluded


KING Command for Ancestry Inference

The ancestry inference in KING is able to identify the most likely ancestral group(s) for each study sample, by leveraging known ancestry in a reference dataset, such as the 1000 Genomes Project data as recommended here. The superpopulation groups that can be inferred include AFR, AMR, EAS, EUR, SAS, for African, American, east Asian, European, and south Asian, respectively.

The actual ancestry inference in KING involves a single command line as in one of the following few variations:

 prompt> king -b KGref.bed,ex.bed --pca --projection --rplot
 prompt> king -b KGref.bed,ex.bed --pca --projection --rplot --prefix ex
 prompt> king -b KGref.bed,ex.bed --pca --projection --pngplot
Among the three variations above, the second command specifies the prefix of the output files (i.e., as ex), and the third command specifies the use of PNG plot (instead of the PDF plot generated through --rplot). Here, the option "--pca --projection" projects each of the ex samples to the 1000 genomes PC space, and then generates the projected PCs for each of the 332 ex samples. The "--rplot" or "--pngplot" option takes PCs generated by KING as the input data, and then carries out a machine learning algorithm by running a separate R code.

The screen printout is like this:

KING 2.2.7 - (c) 2010-2021 Wei-Min Chen

The following parameters are in effect:
                   Binary File : ../data/KGref,../data/ex (-bname)

Additional Options
         Close Relative Inference : --related, --duplicate
   Pairwise Relatedness Inference : --kinship, --ibdseg, --ibs, --homog
              Inference Parameter : --degree, --seglength
         Relationship Application : --unrelated, --cluster, --build
                        QC Report : --bysample, --bySNP, --roh, --autoQC
                     QC Parameter : --callrateN, --callrateM
             Population Structure : --pca [ON], --mds
              Structure Parameter : --projection [1], --pcs
              Disease Association : --tdt
   Quantitative Trait Association : --mtscore
                Association Model : --trait [], --covariate []
            Association Parameter : --invnorm, --maxP
               Genetic Risk Score : --risk, --model [], --prevalence, --noflip
              Computing Parameter : --cpus
                   Optional Input : --fam [], --bim [], --sexchr [23]
                           Output : --rplot [ON], --pngplot, --plink
                 Output Parameter : --prefix [ex], --rpath []

KING starts at Thu May 27 18:29:56 2021
Read in PLINK fam files
        ../data/KGref.fam...
        ../data/ex.fam...
  PLINK pedigrees loaded: 2741 samples
Read in PLINK bim files
        ../data/KGref.bim...
        ../data/ex.bim...
  Genotype data consist of 16824 autosome SNPs
  PLINK maps loaded: 16824 SNPs
Read in PLINK bed files
        ../data/KGref.bed...
        ../data/ex.bed...
  PLINK binary genotypes loaded: 2741 samples
  KING format genotype data successfully converted

Options in effect:
        --pca
        --projection
        --rplot
        --prefix ex

PCA projection starts at Thu May 27 18:30:43 2021
2409 1000 Genomes samples are detected and used as reference.
Preparing matrix (2409 x 2409) for PCA...
  16824 SNPs are used in PCA.
SVD starts at Thu May 27 18:30:44 2021
  LAPACK is being used...
Largest 10 eigenvalues: 2059.73 1366.88 700.55 618.08 276.33 265.47 238.72 217.65 198.71 189.81
Projecting 332 samples starts at Thu May 27 18:30:44 2021
PCA projection ends at Thu May 27 18:30:44 2021
10 principal components saved in file expc.txt
Ancestry populations are inferred as in ex_InferredAncestry.txt
Ancestry plots are generated in ex_ancestryplot.pdf
KING ends at Thu May 27 18:30:56 2021

The ancetry inference results are in the form of both tables and plots. The ancestry table ex_InferredAncestry.txt may look like this:

FID     IID     PC1     PC2     Anc_1st Pr_1st  Anc_2nd Pr_2nd  Ancestry
1328    NA06984 -0.011  0.0268  EUR     0.9934  AFR     0.0032  EUR
1328    NA06989 -0.0104 0.0276  EUR     0.9962  AFR     0.0019  EUR
1330    NA12335 -0.0109 0.0267  EUR     0.9948  AFR     0.0024  EUR
1330    NA12336 -0.0101 0.0277  EUR     0.9965  AFR     0.0019  EUR
1330    NA12340 -0.0105 0.0288  EUR     0.9958  AFR     0.0023  EUR
1330    NA12341 -0.0102 0.0265  EUR     0.9924  AFR     0.0036  EUR
1330    NA12342 -0.0106 0.0279  EUR     0.9958  AFR     0.002   EUR
1330    NA12343 -0.0104 0.0273  EUR     0.9953  AFR     0.0023  EUR
1334    NA10846 -0.0115 0.0271  EUR     0.9959  AFR     0.0019  EUR
In the example above, the first 10 samples are all inferred as European, with probability > 99%.

Here is the visualization of ancestry for the ex samples:


References Other Than 1000 Genomes

KING provides users a convenient way to infer ancestry. However, other well-charactered datasets can be used as reference data as well. For this purpose, we also provide stand-alone R codes to go along with KING to allow even more flexible ancestry inference. The stand-alone R code is available at GitHub.


REFERENCE

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF]


======================================
Last updated: May 28, 2021 by Wei-Min Chen


 
 

KING Tutorial | KING Download | KING Homepage