rr
New Member
Posts: 3
|
Post by rr on Jul 28, 2016 7:35:27 GMT
Hi! I am currently trying to reproduce the results in the Lee et al (2011) paper using the WTCCC-CD data (5k individuals, 500k SNPs). Starting with the QC, I did the following (as in the paper) using the following commands: - exclude SNPs with MAFs < 0.01: --maf 0.01
- exclude SNPs with missing rates > 0.05: --geno 0.05
- exclude SNPs whose p-values were < 0.05 for the H-W test: --hwe 0.05
- exclude individuals with missing rates > 0.01: --mind 0.01
- exclude pairs with an estimated relationship of > 0.05: --grm-cutoff 0.05
First, I made a filtered bed file by performing the first 4 options in plink (all individuals were retained, SNP count down to 391335). Then I used the new bed file for estimating grm. Finally, I ran reml with the last option. This resulted to the following:
Source Variance SE V(G) 0.231562 0.015219 V(e) 0.000000 0.013046 Vp 0.231562 0.005012 V(G)/Vp 0.999999 0.056338 logL 1247.802 logL0 1061.235 LRT 373.135 df 1 Pval 0 n 4677
In the paper, there were 3833 individuals and 322142 SNPs retained. I'm not really sure whether my approach in QC was correct. Could you please point out if I missed anything? Thank you very much! RR
|
|
|
Post by Jian Yang on Jul 29, 2016 1:02:43 GMT
You might also remove SNPs significant difference in missingness rate between cases and controls.
|
|
rr
New Member
Posts: 3
|
Post by rr on Aug 2, 2016 7:54:15 GMT
Thank you for the suggestion. I did as you advised, and instead of using --grm-cutoff, find all pairs with estimated relationship > 0.05 and removed the pair from the list of individuals (Not sure but in the description of --grm-cutoff, it says remove one of a pair of individuals. Does it remove only one individual in the pair? Or removes both individuals?). I also removed SNPs with CHR = 0 or 23 in the bim file. In the end, I got 363393 SNPs and 4289 individuals (still different from the paper). I was thinking, wouldn't it matter if I remove some individuals/SNPs first, then run the stats for missingness or allele frequencies? (i.e., missing rate/MAF might change?) If that is the case, what kind of approach would you suggest? For example, is it recommended to filter everything at once, like running the options --maf, --hwe, --geno, or --mind at the same time? Or should it be sequential? Like run --maf first, then run --hwe on the output. Then run --geno on the second output, and so on?
|
|