KING Tutorial: Population Structure Inference
KING can be used to identify population substructure using high-throughput SNP data. Two methods are available in KING, including Multidimensional Scaling (MDS) and Principal Component
Analysis (PCA).
Warning: Precompile KING binaries with versions lower than 2.2.3 are not suitable for population structure analysis in larger datasets for lacking LAPACK libraries.
Please download KING precompiled binaries version 2.2.4 or later for Linux systems.
Multidimensional Scaling (MDS)
The Multidimensional Scaling (MDS) with the Euclidean distance is highly recommended for the identification of population substructure. Option "--mds" needs to be specified for the MDS analysis:
prompt> king -b ex.bed --mds
prompt> king -b ex.bed --mds --rplot
Top principal components / ancestry coordinates (20 by default) are saved in files kingpc.txt. The kingpc.txt may look like
this:
FID IID FA MO SEX AFF PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20
1328 NA06984 0 0 1 1 -0.0545 0.0117 -0.0179 0.0081 -0.0293 0.0126 -0.0077 0.0143 -0.0061 0.0159 -0.0055 0.0260 -0.0184 0.0079 -0.0121 0.0143 0.0024 -0.0112 0.0204 -0.0265
1328 NA06989 0 0 2 1 -0.0542 0.0031 -0.0030 0.0115 0.0070 -0.0110 -0.0242 -0.0006 0.0078 0.0079 -0.0094 0.0137 0.0087 0.0036 -0.0299 -0.0031 -0.0149 -0.0054 0.0348 -0.0082
1330 NA12335 NA12340 NA12341 1 1 -0.0550 0.0063 -0.0353 -0.0021 0.1184 0.0747 -0.0337 -0.1091 -0.0734 0.0203 0.0146 0.0174 -0.1601 -0.0513 -0.0819 -0.0141 -0.0115 -0.0557 0.0547 0.0286
1330 NA12336 NA12342 NA12343 2 1 -0.0548 0.0380 -0.0058 0.0276 -0.0665 -0.0796 0.0319 0.0224 -0.1627 -0.0613 -0.1429 -0.1600 0.0735 -0.0596 0.0093 -0.0936 -0.1194 -0.1304 -0.0086 -0.0362
1330 NA12340 0 0 1 1 -0.0549 0.0095 -0.0274 -0.0035 0.0664 0.0375 -0.0347 -0.0736 -0.0648 0.0177 0.0151 -0.0015 -0.1308 -0.0444 -0.0518 -0.0305 -0.0007 -0.0363 0.0370 0.0228
1330 NA12341 0 0 2 1 -0.0534 0.0020 -0.0204 0.0067 0.0910 0.0755 -0.0168 -0.0788 -0.0445 0.0119 0.0090 0.0288 -0.0859 -0.0281 -0.0617 0.0051 -0.0188 -0.0488 0.0261 0.0324
1330 NA12342 0 0 1 1 -0.0547 0.0295 -0.0072 0.0189 -0.0535 -0.0451 0.0273 0.0192 -0.1190 -0.0273 -0.1062 -0.1218 0.0484 -0.0391 0.0045 -0.0634 -0.0821 -0.0998 0.0171 -0.0216
1330 NA12343 0 0 2 1 -0.0546 0.0219 -0.0140 0.0202 -0.0423 -0.0595 0.0185 0.0123 -0.1038 -0.0466 -0.0910 -0.1054 0.0649 -0.0589 0.0123 -0.0600 -0.0764 -0.0885 -0.0220 -0.0264
1334 NA10846 NA12144 NA12145 1 1 -0.0561 0.0068 -0.0126 -0.0318 -0.0476 0.0463 -0.0614 0.0760 0.0276 0.0580 0.0319 -0.0216 0.0215 -0.0286 0.0122 -0.0671 0.0552 0.1801 -0.1843 0.0054
Each row provides summary information for a sample. The top 10 principal components/ ancestry coordinates are in the 7th to the 16th columns.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA) is a widely-used method to identify population substructure. Please run LD-pruning prior to PCA analysis. Option "--pca" needs to be specified:
prompt> king -b ex.bed --pca
prompt> king -b ex.bed --pca --rplot
The top 10 pincipal components / ancestry coordinates are saved in files kingpc.txt, which has the same format as kingpc.txt from the --mds analysis.
OTHER PARAMETERS
The following parameters can also be specified during population structure inference in KING:
--pcs specifies the number of PCs for PCA/MDS. The default pcs is 10.
--projection goes with --pca or --mds to project affected samples to the reference samples' PC space.
--prefix specifies the name of the output file for PC results.
REFERENCE
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM
(2010) Robust relationship inference in genome-wide association studies.
Bioinformatics 26(22):2867-2873
[Abstract]
[PDF]
======================================
Last updated: Oct 11, 2019 by Wei-Min Chen
|