KING Tutorial: Quality Control (QC)

QC in KING is integrative and powerful. The QC reports include QC-by-sample, QC-by-SNP, and automated QC. In addition, this tutorial also includes run of homozygosity (ROH) analysis for estimating inbreeding coefficient and detecting the exact segments of homozygosity run, which allows investigators to better understand the genetics in their data.

QC BY SNP

Examples of QC-by-SNP analysis are:

  prompt> king -b ex.bed --bySNP
  prompt> king -b ex.bed --cluster --bySNP

The command above scans the genome and reports a variety of QC statisitcs at each SNP. Additional option --cluster allows family-based QC without using any reported pedigrees. The QC report is saved in kingbySNP.txt. The columns in the QC report file are:

SNP: SNP name
Chr: Chromosome number of the SNP
Pos: Position of the SNP
Label_A: Label of the reference allele
Label_a: Label of the alternative allele
Freq_A: Frequency of the reference allele
N: Total number of samples with non-missing genotypes
N_AA: Total number of samples with genotype AA
N_Aa: Total number of samples with genotype Aa
N_aa: Total number of samples with genotype aa
CallRate: Proportion of samples with non-missing genotypes
N_MZ: Total number of MZ twins or duplicates
N_errMZ: Total number of inconsistencies between duplicates
Err_InMZ: Error rate in duplicates
N_PO: Total number of parent-offspring (PO) pairs
N_HomPO: Total number "informative" PO pairs (at least one carries the minor homozygote)
N_errPO: Total number of Medelian inconsistencies (MI) (AA->aa or aa->AA) in PO pairs 
Err_InPO: Error rate in PO pairs (N_errPO / N_PO)
Err_InHomPO: N_errPO / N_HomPO
N_trio: Total number of parent-offspring (PO) trios
N_HetOff: Total number of heterozygote offspring
N_errTrio: Total number of Medelian inconsistencies (MI) (AA x aa -> Aa) in PO trios
Err_InTrio: Error rate in PO trios (N_MIt / N_trio)
Err_InHetTrio: N_Mit / N_Het

QC BY SAMPLE

Examples of QC-by-sample analysis are:

  prompt> king -b ex.bed --bysample
  prompt> king -b ex.bed --cluster --bysample

The command above scans the genome and reports a variety of QC statisitcs for each individual. Additional option --cluster allows family-based QC without using any reported pedigrees. The QC report is saved in kingbySample.txt. The columns in the QC report file are:

FID: Family ID
IID: Individual ID
FA: Father ID
MO: Mother ID
SEX: Sex
N_SNP: Total number of non-missing SNPs on autosomes
Missing: SNP missing rate on autosomes
Heterozygosity: Heterozygosity on autosomes
N_Pair: Total number of SNPs that are not missing for the parent-offspring (PO) pair that the individual is involved
N_MIp: Total number of Mendelian inconsistencies (MI) (AA -> aa or aa -> AA) in the PO pair
Err_MIp: Error rate in the PO pair
N_trio: Total number of SNPs that are not missing for the PO trio
N_MIt: Total number of MIs (AA x aa -> Aa) in the PO trio
Err_MIt: Error rate in the PO trio
MI_Removal: Flag for removal 

Automated QC

--autoQC option performs a straightforward QC pipeline, including sample-level QC (at call rate 95% by default, or a different call rate set by --callrateN), SNP-level QC (at call rate 95% by default, or a different call rate set by --callrateM), and gender QC. This analysis generates a list of SNPs to be removed, and a list of samples to be removed. Examples of autoQC analysis are

  prompt> king -b ex.bed --autoQC


Run of Homozygosity

--roh option scans the genome and identifies runs of homozygosity (ROH). Examples of ROH analysis are

  prompt> king -b ex.bed --roh
Inbreeding coefficient for each sample is generated in file king.roh, and the exact ROH segments are saved in a gzipped file king.rohseg.gz. The inbreeding coefficient file king.roh will look like:
FID     ID      FA      MO      SEX     MaxROH  FInbred
1328    NA06984 0       0       1       0.0     0.0000
1328    NA06989 0       0       2       0.0     0.0000
1330    NA12335 NA12340 NA12341 1       0.0     0.0000
1330    NA12336 NA12342 NA12343 2       0.0     0.0000
1330    NA12340 0       0       1       0.0     0.0000
1330    NA12341 0       0       2       0.0     0.0000
1330    NA12342 0       0       1       31.3    0.0449
1330    NA12343 0       0       2       0.0     0.0000
1334    NA10846 NA12144 NA12145 1       0.0     0.0000
The ROH segment file king.rohseg.gz will look like:
FID     ID      Chr     StartMB StopMB  StartSNP        StopSNP         N_SNP   Length
1330    NA12342 5       70.869  97.455  AFFX-SNP_7697354__rs276593      rs10866786      156     26.6
1330    NA12342 5       136.510 167.849 rs11745163      rs582906        224     31.3
1330    NA12342 6       25.472  31.787  rs13215347      rs805286        69      6.3
1346    NA10852 2       30.803  51.636  rs2681682       rs2698026       162     20.8
1459    NA12874 1       148.175 247.083 rs1868992       rs12058711      692     98.9
Y045    NA19201 5       70.869  85.089  AFFX-SNP_7697354__rs276593      rs10063186      103     14.2
Y057    NA19224 10      90.301  105.270 rs7901991       rs12268628      96      15.0
Y079    NA19113 17      0.116   10.842  rs4247500       rs4792080       114     10.7


OTHER PARAMETER

The following parameters can also be specified:

--callrateN specifies the sample-level call rate.

--callrateM specifies the SNP-level call rate.

--prefix specifies the name of the output file for QC results. "king" is used as default.

--cpus specifies the number of CPU cores to be used in the parallel computing. If not specified, the default number is half of the total number of (logical) cores.


REFERENCE

Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26(22):2867-2873 [Abstract] [PDF]


======================================
Last updated: February 21, 2018 by Wei-Min Chen


 
 

KING Tutorial | KING Download | KING Homepage