KING Tutorial: Quality Control (QC)
QC in KING is integrative and powerful.
The QC reports include QC-by-sample, QC-by-SNP, and automated QC.
In addition, this tutorial also includes run of homozygosity (ROH) analysis
for estimating inbreeding coefficient and detecting the exact segments of homozygosity run,
which allows investigators to better understand the genetics in their data.
QC BY SNP
Examples of QC-by-SNP analysis are:
prompt> king -b ex.bed --bySNP
prompt> king -b ex.bed --cluster --bySNP
The command above scans the genome and reports a variety of QC statisitcs at each SNP.
Additional option --cluster allows family-based QC without using any reported pedigrees.
The QC report is saved in kingbySNP.txt. The columns in the QC report file are:
SNP: SNP name
Chr: Chromosome number of the SNP
Pos: Position of the SNP
Label_A: Label of the reference allele
Label_a: Label of the alternative allele
Freq_A: Frequency of the reference allele
N: Total number of samples with non-missing genotypes
N_AA: Total number of samples with genotype AA
N_Aa: Total number of samples with genotype Aa
N_aa: Total number of samples with genotype aa
CallRate: Proportion of samples with non-missing genotypes
N_MZ: Total number of MZ twins or duplicates
N_errMZ: Total number of inconsistencies between duplicates
Err_InMZ: Error rate in duplicates
N_PO: Total number of parent-offspring (PO) pairs
N_HomPO: Total number "informative" PO pairs (at least one carries the minor homozygote)
N_errPO: Total number of Medelian inconsistencies (MI) (AA->aa or aa->AA) in PO pairs
Err_InPO: Error rate in PO pairs (N_errPO / N_PO)
Err_InHomPO: N_errPO / N_HomPO
N_trio: Total number of parent-offspring (PO) trios
N_HetOff: Total number of heterozygote offspring
N_errTrio: Total number of Medelian inconsistencies (MI) (AA x aa -> Aa) in PO trios
Err_InTrio: Error rate in PO trios (N_MIt / N_trio)
Err_InHetTrio: N_Mit / N_Het
QC BY SAMPLE
Examples of QC-by-sample analysis are:
prompt> king -b ex.bed --bysample
prompt> king -b ex.bed --cluster --bysample
The command above scans the genome and reports a variety of QC
statisitcs for each individual.
Additional option --cluster allows family-based QC without using any reported pedigrees.
The QC report is saved in kingbySample.txt. The columns in the QC report file are:
FID: Family ID
IID: Individual ID
FA: Father ID
MO: Mother ID
SEX: Sex
N_SNP: Total number of non-missing SNPs on autosomes
Missing: SNP missing rate on autosomes
Heterozygosity: Heterozygosity on autosomes
N_Pair: Total number of SNPs that are not missing for the parent-offspring (PO) pair that the individual is involved
N_MIp: Total number of Mendelian inconsistencies (MI) (AA -> aa or aa -> AA) in the PO pair
Err_MIp: Error rate in the PO pair
N_trio: Total number of SNPs that are not missing for the PO trio
N_MIt: Total number of MIs (AA x aa -> Aa) in the PO trio
Err_MIt: Error rate in the PO trio
MI_Removal: Flag for removal
Automated QC
--autoQC option performs a straightforward QC pipeline,
including sample-level QC (at call rate 95% by default, or a different call rate set by --callrateN),
SNP-level QC (at call rate 95% by default, or a different call rate set by --callrateM), and gender QC.
This analysis generates a list of SNPs to be removed, and a list of samples to be removed.
Examples of autoQC analysis are
prompt> king -b ex.bed --autoQC
Run of Homozygosity
--roh option scans the genome and identifies runs of homozygosity (ROH). Examples of ROH analysis are
prompt> king -b ex.bed --roh
Inbreeding coefficient for each sample is generated in file king.roh,
and the exact ROH segments are saved in a gzipped file king.rohseg.gz.
The inbreeding coefficient file king.roh will look like:
FID ID FA MO SEX MaxROH FInbred
1328 NA06984 0 0 1 0.0 0.0000
1328 NA06989 0 0 2 0.0 0.0000
1330 NA12335 NA12340 NA12341 1 0.0 0.0000
1330 NA12336 NA12342 NA12343 2 0.0 0.0000
1330 NA12340 0 0 1 0.0 0.0000
1330 NA12341 0 0 2 0.0 0.0000
1330 NA12342 0 0 1 31.3 0.0449
1330 NA12343 0 0 2 0.0 0.0000
1334 NA10846 NA12144 NA12145 1 0.0 0.0000
The ROH segment file king.rohseg.gz will look like:
FID ID Chr StartMB StopMB StartSNP StopSNP N_SNP Length
1330 NA12342 5 70.869 97.455 AFFX-SNP_7697354__rs276593 rs10866786 156 26.6
1330 NA12342 5 136.510 167.849 rs11745163 rs582906 224 31.3
1330 NA12342 6 25.472 31.787 rs13215347 rs805286 69 6.3
1346 NA10852 2 30.803 51.636 rs2681682 rs2698026 162 20.8
1459 NA12874 1 148.175 247.083 rs1868992 rs12058711 692 98.9
Y045 NA19201 5 70.869 85.089 AFFX-SNP_7697354__rs276593 rs10063186 103 14.2
Y057 NA19224 10 90.301 105.270 rs7901991 rs12268628 96 15.0
Y079 NA19113 17 0.116 10.842 rs4247500 rs4792080 114 10.7
OTHER PARAMETER
The following parameters can also be specified:
--callrateN specifies the sample-level call rate.
--callrateM specifies the SNP-level call rate.
--prefix specifies the name of the output file for QC results. "king" is used as default.
--cpus specifies the number of CPU cores to be used in the parallel computing. If not
specified, the default number is half of the total number of (logical) cores.
REFERENCE
Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM
(2010) Robust relationship inference in genome-wide association studies.
Bioinformatics 26(22):2867-2873
[Abstract]
[PDF]
======================================
Last updated: February 21, 2018 by Wei-Min Chen
|