|
Post by nikolay on Jun 6, 2014 14:37:09 GMT
Dear Jian,
I am new to GCTA and to the whole topic of "heritability estimate" in general because I came from the Natural Sciences side, more specifically from Theoretical Physics. I am currently going through your papers trying to understand the mathematical formalism underlying GCTA and I have a lot of questions to you. May I start with a question which is not directly connected to GCTA, but to the methodology of "heritability estimate"?
When I first got a question from my colleagues regarding if I could estimate heritability explained by our GWAS, I thought that it was very easy, since one can just take a few (hundred?) top SNPs from our GWAS list, insert them into the multiple regression model and calculate correlation coefficient R^2=1-Sum(y_i-f_i)^2/Sum(y_i-mean(y_i))^2, where y_i are the phenotype values and f_i are the predicted values from the linear model f_i=a*x_i+b+epsilon. Then we can add SNP by SNP from the GWAS list following the p-value ranking and see how R^2(adjusted) is changed and where it reaches plateau. For our GWAS dataset R^2 reached plateau at 0.74, which in my opinion should say to us that our GWAS explained 74% of heritability.
Later, when I read your GCTA paper I realized that you do not simply calculate R^2 but use some relatedness estimate between individuals to address the heritability. Moreover, you use the whole GWAS set of variants while one could expect that only the top variants (the ones with lowest p-values of association with phenotype) should explain most of the the heritability. Otherwise, if we include too many SNPs from the bottom of the GWAS list into the heritability analysis, we are going to the noise zone, where most of the SNPs are not informative since they do not have any association with the phenotype. Therefore the question would be if we can find an optimal number of SNPs which gives a maximum of explained heritability. In other words, if we plot heritability as a function of SNPs added to the analysis, this function should peak at some optimal number of SNPs.
I am a bit puzzled why I do not see any mentioning of "optimal number of SNPs" in your papers? Perhaps I misunderstand something and hope that you could explain this to me. Likewise, I got a feeling looking through the online methods in your papers that GCTA estimates the genetic contribution to the heritability but not the genetic contribution which is explained by GWAS, doesn't it? In addition, could you please comment on the above described procedure of calculating R^2 and let me know if my understanding is correct? I would very much appreciate if you could clarify these issues to me and my colleagues. Thank you!
Best wishes, Nikolay.
|
|
|
Post by Jian Yang on Jun 7, 2014 12:21:16 GMT
Hi Nikolay,
If the SNPs are selected by association p-values, the estimate of variance explained by these selected SNPs in the same sample will be inflated using multiple regression or whatever estimation analysis. This is due to the 'winner's curse' issue. For example, if you perform a GWAS for a random trait (without any genetic effects), and select the top 100 SNPs based on the association results, you will be surprised to see how much proportion of phenotypic variance can be explained by these "top associated SNPs". We have theory to quantify this (Wray et al. 2013 Nature Reviews Genetics 14, 507-515).
The method behind GCTA-GREML is that instead of testing the associations for individual SNPs, we can actually fit the effects of all SNPs as random effects in a mixed linear model and estimate a single parameter, i.e. the variance explained by all SNPs or SNP-heritability. To reconcile this analysis with the traditional pedigree-based heritability estimation analysis, we showed by theory that the SNP-based model is equivalent to an individual-based model with genetic relatedness estimated from SNPs.
Cheers, Jian
|
|
|
Post by nikolay on Jun 12, 2014 13:06:45 GMT
Hi Jian,
thanks a lot for your reply and for the reference to the fantastic paper of Wray et al.! The mentioning of "winner's curse" was very helpful, I fully agree with you, we had a serious spurious inflation of adjusted R^2 because we did heritability estimate using the discovery sample for validation. Indeed, in this case, doing association test we essentially minimized residuals of the linear model which we used afterwards for calculating R^2, and of course R^2 was very high because for it's calculation we input minimized residuals. Do I understand you correctly (and the paper of Wray et al.) that the right way to do it is to calculate adjusted R^2 (do linear fitting) for top associated SNPs in a cohort different from the discovery cohort, getting rid in this way of the "winner's curse"?
I also would like to ask you about the SNP-heritability estimate with GCTA. I can not understand how you fit all SNPs (say a few millions) having a limited amount of individuals (say a few thousands). The number of observations/individuals limits the number of variables which are allowed to be included into the linear model, otherwise you will have an overdetermined set of equations with more variables than equations if we talk about multiple linear regression model.
Another my concern is that when you fit all SNPs you include a lot of "non-informative" SNPs, e.g. the ones which are in high LD (or not causal) and therefore should not contribute to R^2. This is ok to do if you have an infinite amount of individuals. However, if you have a limited amount of individuals you can not afford including too many non-informative/noisy SNPs since adding each noisy SNP reduces by 1 the number of individuals available for fitting, and R^2 does not have a chance to increase a lot before the amount of individuals has finished.
I agree that we should estimate heritability using only causal SNPs. However, including all SNPs in the genome into the linear model brings a lot of noise and spoils the model. On the other hand, doing heritability estimates with GWAS top associated SNPs is more reasonable in my opinion since associations provide variants which are potentially causal. Moreover, the variants included into the linear model should be independent, since adding variants in LD with each other does not improve R^2. For this purpose, if one does a harsh LD pruning (with say cutoff 0.1) of GWAS top associated variants prior to including them into the linear model, it assures that only independent/informative variants will be fitted.
I would appreciate if you could comment on my arguments on adding noise to the linear model by fitting all variants. Thank you!
Best wishes, Nikolay.
|
|
|
Post by Jian Yang on Jun 18, 2014 2:58:01 GMT
A fixed effect model can fit more parameters (p) than observations (n) but there is no unique solution. Some solutions have very big positive estimates for some SNPs and very large negative estimates for other SNPs. This doesn't seem reasonably because we have some vague prior belief that many large SNP effects that cancel out in observed genotypes is an unlikely true situation.
Two equivalent ways to explain fitting more parameters than observations (p>n) are:
1. The SNP effects are assumed to come from a normal distribution and this prior knowledge makes it possible to get a unique solution.
2. A random effects model is like maximising a penalised Log likelihood where the penalty depends on the distribution of true SNP effects. If it is normal, the penalty is the sum of the squared SNP effect. Thus the unique solution tries to keep any of the SNP effects from being too big.
Of course, the best model is alway fitting the effect sizes of causal variants which we don't know. Adding null SNPs into the model will not bias the estimate but will increase the uncertainty of the estimate (standard error) that's why we usually need a very large sample size to do the GREML analysis.
Selecting candidate SNPs is one way to go but this needs to be done in an independent sample.
|
|
|
Post by nikolay on Jun 26, 2014 12:18:23 GMT
Hi Jian,
1) I really appreciate your extremely helpful comments! I have a question about calculating explained heritability in an independent sample using candidate SNPs. From the paper of Wray et al., I got an impression that re-estimating candidate SNP effects in an independent sample can lead to an overestimation of adjusted R^2. On the other hand, to my understanding, this is not due to Winner's curse because SNPs that are top associated in discovery sample might have a very bad association (large residuals) in the validation sample. Therefore one can obtain a very small adjusted R^2 in the validation sample. Then why should this way be biased?
2) The way of estimating explained heritability that they used in the "Height" paper, Nature 467, 832-838 (2010), when they predicted phenotype of each individual in the validation sample using effects from discovery sample, what do you think this way essentially misses? In the "height" paper, they included SNPs down to p-value=5*10^(-2) in order to predict phenotypes, this is similar to GCTA that also includes all SNPs, but the "height" paper explained 16% of variance by calculating adjusted R^2 and GCTA explained 45% of variance via the GREML approach. What exactly in methodology makes such a big difference between the two methods?
3) Another, a bit unrelated question concerns the way you define the genetic relationship matrix in GCTA. I do not fully understand why you subtract the frequency of the reference allele from the number of copies of the reference allele, i.e. x-2*p, for each individual. I have a vague feeling that it should be like subtracting the mean that comes from the binomial distribution, and p is somehow a probability of "success", but do not fully understand why 2*p (btw do we have only 2 observations?) is the mean of x and what is "observation" and what is "success" in this binomial distribution. Could you please explain this to me? Thank you!
Cheers, Nikolay.
|
|
|
Post by Jian Yang on Jun 28, 2014 13:00:17 GMT
Re 1) Estimate of variance explained by SNPs selected from a discovery sample in an independent validation sample is unbiased.
Re 2) Prediction and estimation are two distinct concepts. To achieve a high prediction accuracy, the effect sizes of all associated SNPs need to be estimated with little error, whereas in an estimation analysis we only estimate a single parameter i.e. variance explained. Prediction accuracy increases with sample size for discovery whereas variance explained doesn't (its SE decreases with sample size).
Re 3) x (coded as 0, 1 or 2) ~ binomial(2, p) so E(x) = 2p and var(x) = 2p(1-p)
|
|
|
Post by nikolay on Jun 2, 2015 8:10:41 GMT
Hi Jian,
first of all, I would like to express my grattitude for your nice explanations to my post that helped me a lot to understand GCTA much better! I would like to revisit our conversation and hope you will find time to answer one more question.
I have recently started using sparse linear models (lasso, ridge and elastic net) as a way to avoid overfitting, i.e. when we have more variables than observations. You probably know that these models represent an ordinary least square (OLS) minimization of residuals with a penalty on snp effects, which is typically an inclusion of a sum of squared effects into the OLS, which puts a constrant on snp effects, i.e. does not allow for large snp effects.
I recalled that you mentioned that another way to avoid overfitting is to model snp effects as random effects. I understand that in random effects modeling one assumes that each individual's genotype is measured with an error and thus there is a distribution for each genotype. But I can not understand how it can be related to puting a constraint on snp effects that is the essence of lasso regression? In other words, could you please briefly explain me (or direct me to a good paper) why using random effects we may not worry about overfitting? Would greatly appreciate your reply, thank you!
Best wishes, Nikolay.
|
|
erika
New Member
Posts: 1
|
Post by erika on Oct 10, 2017 0:56:15 GMT
Hi Nikolay, If the SNPs are selected by association p-values, the estimate of variance explained by these selected SNPs in the same sample will be inflated using multiple regression or whatever estimation analysis. This is due to the 'winner's curse' issue. For example, if you perform a GWAS for a random trait (without any genetic effects), and select the top 100 SNPs based on the association results, you will be surprised to see how much proportion of phenotypic variance can be explained by these "top associated SNPs". We have theory to quantify this (Wray et al. 2013 Nature Reviews Genetics 14, 507-515). The method behind GCTA-GREML is that instead of testing the associations for individual SNPs, we can actually fit the effects of all SNPs as random effects in a mixed linear model and estimate a single parameter, i.e. the variance explained by all SNPs or SNP-heritability. To reconcile this analysis with the traditional pedigree-based heritability estimation analysis, we showed by theory that the SNP-based model is equivalent to an individual-based model with genetic relatedness estimated from SNPs. Cheers, Jian Dear Jian, You mentioned that selected the top association SNPs will introduce bias. In this case, do you recommend put all the SNPs pass QC into GCTA? And what happen if we want to estimate the heritability of single variant or single locus? Thanks! Erika
|
|
|
Post by Jian Yang on Oct 17, 2017 22:24:04 GMT
Please see Supplementary Note 8 of Yang et al. 2017 Nat Genet ( www.nature.com/ng/journal/v49/n9/abs/ng.3941.html). If the SNPs are selected from a dataset that is independent of the data set for GCTA-GREML analysis, then there will be no bias. However, I would not use GCTA-GREML to estimate the variance explained by a single SNP although in principle it should work OK.
|
|