Population stratification not being conducted properly?

eoind
New Member

Posts: 2

Population stratification not being conducted properly? Dec 15, 2015 16:56:18 GMT

Quote

Post by eoind on Dec 15, 2015 16:56:18 GMT

Hi all,

I want to do an association analysis between 3,000 samples (divided into 60 dog breeds) and 97,000 SNPs. the trait of interest (a percentage) is quantative. Specifically, I want to ensure that I account for population stratification to prevent any SNP-trait associations due to breed differences.

My data set is standard PLINK formatted, the species is dog, and the phenotype file looks like this:

FAMID INDID Percent
PFZ1E02 PFZ1E02 0.0190609818983
PFZ1C03 PFZ1C03 0.0316493529134
PFZ1F03 PFZ1F03 0.0316493529134
PFZ1G03 PFZ1G03 0.0316493529134

I ran this set of commands:

1. Make a genetic relationship matrix, accounting for the fact that I'm using dog breeds:

./gcta_1/gcta64 --bfile Data --make-grm --autosome-num 38 --autosome

2. Calculate the first 20 principal component axes:

./gcta_1/gcta64 --grm gcta.grm --pca 20 --out eigen

3. Include the principal component analyses into the association analysis, to identify association between trait and SNPs, accounting for breed differences.

/gcta_1/gcta64 --mlma --bfile Data --qcovar eigen.eigenvec --grm gcta --pheno pheno.dat2 --out test_gcta

The problem is that in my output, almost all of my SNPs are significant (P < 1e-10).

Sample of output file:
Chr SNP bp A1 A2 Freq b se p
1 BICF2P1383091 212740 A G 0.0208885 0.0972874 0.000891171 0
1 BICF2G630707908 273487 A G 0.161405 0.0190486 0.000217773 0
1 BICF2P41862 390563 A G 0.123784 -0.044304 0.000407207 0
1 BICF2G630707932 420036 G A 0.465413 -0.0385957 0.000249306 0

....I've clearly done something wrong, as I do not think it is possible for almost all of my 97,000 SNPs to have very very low p values for association with my trait, after accounting for population stratification.

Could someone please tell me the correct command that I should have used? The command should: Read in a bed/bim file, and a phenotype file (formatted as in above example), correct for any potential population stratification and conduct an analysis between my quantative trait and set of SNPs.

Thanks.

Jian Yang
Administrator

Posts: 362

Population stratification not being conducted properly? Dec 15, 2015 23:11:23 GMT

Quote

Post by Jian Yang on Dec 15, 2015 23:11:23 GMT

1) You might check if you are using the latest version

2) You don't need to fit PCs once you have already fit the whole GRM
gcta64 --mlma --bfile Data --grm gcta --pheno pheno.dat2 --out test_gcta

3) You might compare the result with PLINK-linear
plink --linear --bfile Data --pheno pheno.dat2 --out test_plink

eoind
New Member

Posts: 2

Population stratification not being conducted properly? Dec 16, 2015 11:09:05 GMT

Quote

Post by eoind on Dec 16, 2015 11:09:05 GMT

Thank you for the reply.

1. I obtained the software from cnsgenomics.com/software/gcta/download.html and am using gcta_1.25.1.zip, which I believe is the most recent (8 Dec 2015).

2. I ran the command as suggested: gcta64 --mlma --bfile Data --grm gcta --pheno pheno.dat2 --out test_gcta (almost exactly copied and pasted, only changed the --bfile name).

In case it helps, the output printed to screen regarding calculations is:

Performing REML analysis ... (Note: may take hours depending on sample size).
3645 observations, 1 fixed effect(s), and 2 variance component(s)(including residual variance).
Calculating prior values of variance components by EM-REML ...
Updated prior values: 0.00035625 0.000356252
logL: -0.0142184
Running AI-REML algorithm ...
Iter. logL V(G) V(e)
1 -0.01 221.82734 0.00000 (1 component(s) constrained)
2 -29099.43 221.82734 0.00006
3 -29099.43 221.82734 0.00013
Log-likelihood ratio converged.

Calculating allele frequencies ...
Recoding genotypes (individual major mode) ...
Running association tests for 97055 SNPs ...

3. did test with plink (ran almost exactly this command: plink --linear --bfile Data --pheno pheno.dat2 --out test_plink) , and the results are different.

I've pulled out three sample SNPs to show the difference in P Values:

CHR SNP BP A1 TEST NMISS BETA STAT P
1 BICF2G630708903 5693910 A ADD 3644 -0.00116 -1.124 0.2611
1 BICF2G630708810 5648072 A ADD 3638 -0.0001593 -0.2142 0.8304
1 TIGRP2P7837_rs8513845 8371377 A ADD 3607 -0.0003678 -0.6227 0.5335

From GCTA, for the same 3 SNPs, the results are:
Chr SNP bp A1 A2 Freq b se p
1 BICF2G630708903 5693910 A G 0.0805434 0.0192038 0.000454963 0
1 BICF2G630708810 5648072 A C 0.178257 0.00212381 0.000399292 1.04377e-07
1 TIGRP2P7837_rs8513845 8371377 A G 0.267397 -0.00229019 0.000175264 5.07984e-39

But the problem is that I can't use that SNP command, as I don't believe PLINK can adequately account for population stratification in a non-adhoc way.

So I have two questions:

1. If you had any idea what I'm doing wrong/why I am getting so many GCTA significant SNPs, that would be fantastic.

2. I know other softwares have a genomic control lambda value printed to screen, and if lambda is < 1.1, population stratification is adequately accounted for. How do I know that population stratification has been properly accounted for using this software? For example, only for the fact that there's no way almost all 97,000 SNPs are associated with my trait, I may have believed my result if only one or two SNPs were significantly associated. Is there a score or something that can be calculated to ensure that the population stratification has been accounted for?

Thanks