Combining two imputed dataset

Combining two imputed dataset Apr 28, 2016 21:52:39 GMT alexis likes this

Quote

Post by swvanderlaan on Apr 28, 2016 21:52:39 GMT

Hi,

We have genotyped our study in two batches on Affymetrix SNP5 and Affymetrix Axiom CEU. The samples for these two batches were obtained from 2002-2013, but most samples from batch SNP5 are obtained between 2002 and 2007, while the AxiomCEU batch was mostly obtained between 2007 and 2013.
The QC of the genotyped two datasets and the subsequent imputation on 1000G was done per batch. We then merged the resulting imputed datasets into one dataset. For the analyses we calculated PCs again on the two genotyped datasets separately. For the analyses of traits we usually correct for age, sex, these PCs and the batch.
This time I want to calculate the heritability for 7 traits using the above described data. I wanted to use the combined imputed dataset, and prior to using GCTA on that I:
- made PLINK-style files
- filtered on MAF < 0.005, HWE p <1e-6, info-score<0.95.

I then proceeded in creating a GRM, again on the merged dataset. Now I have a couple of related questions.

1) What do the diagonals and the off-diagonals in non-statistical, laymen terms mean? (I am a medical biologist, and not a statistician.) From the paper I could determine (I think) that the off-diagonals are a measure of relatedness.
2) What does it mean, when I see a bimodal distribution of the diagonals and a skewed distribution of the off-diagonals? Could it be that the determination of the GRM is so sensitive, that in fact the two batches can be discerned in the distribution of the diagonals? And does it mean that there is some relatedness among individuals as the off-diagonal distribution is skewed? I don’t understand that completely: after all our rigorous QC also included PCA and the two batches are actually quite homogenous (and from European descent). When compared to the supplemental figure 1 of your paper ("Common SNPs explain a large proportion of the heritability for human height", NG 2010) my plots look different - which is a concern to me, because I guess I expected a normal distribution by intuition, although I can't verbalise my intuition completely.
3) It is not entirely clear to me from the website: but is it possible to correct for covariates while creating the GRM? Perhaps that could circumvent the issue of the bimodal distribution?
4) I wanted to use the imputed data, as it would mean an significant increase in variants and thus (I thought) power. But in light of the above, I might be wrong in this line of thinking. Would it be wise to divert to the QC’d genotyped datasets and determine the heritability of the 7 traits using these two batches? An advantage would also be that I’d have in essence a validation dataset: batch 1 would be my test-case, batch 2 would be my validation-case. Different sampling period, different genotyped SNPs - but I wouldn’t expect the underlying genetics of the 7 traits to be different in a timespan of ~11 years.

Here's a PDF of these plots.

aegscombo1kGRAW.GCTA.pdf (8.67 KB)

Many thanks and best,

Sander

Post by swvanderlaan on Apr 28, 2016 21:52:39 GMT

Post by Jian Yang on May 5, 2016 4:42:36 GMT