MLMA sample size limit | Complex Trait Genetics Forum

mikem
New Member

Posts: 2

MLMA sample size limit Nov 19, 2015 12:03:20 GMT

Quote

Post by mikem on Nov 19, 2015 12:03:20 GMT

I am trying to test the limits of GCTA --mlma capacity by steadily increasing the number of samples I am analyzing with a linear mixed model. I am using a pre-computed GRM, and have tested the MLM analysis with 12k, 14k, 25k and 40k individuals. The compute time increases in non-linear time (probably to be expected). The GRM is re-computed for each new sample size.

However, when I attempt to run the MLM analysis with 50K samples I get a memory error:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

I have generally found that this arises when the individuals in the GRM and input genotype/phenotype files are not perfectly matched for smaller samples sizes. I am running it on a compute node with >300GB of RAM, and the maximum observed mem usage is ~78G. There is one other process on that node which has reserved 60G virtual memory. Each run is using 20 CPUs.

I'm running the latest version of GCTA (v1.25.0). Whilst I expect there to be an increase in compute time. I had failure issues previously with duplicated IDs (v 1.24.7).

Is this a natural sample size limit for GCTA, or is there scope to expand this further? My aim is to be able to analyse ~150k individuals with this MLM. I have looked into using BOLT-LMM, however, in it's current incarnation it does not take a pre-computed GRM.

Thanks
Mike

Jian Yang
Administrator

Posts: 362

MLMA sample size limit Dec 2, 2015 3:10:30 GMT

Quote

Post by Jian Yang on Dec 2, 2015 3:10:30 GMT

You might try to run this on individual chromosomes separately (e.g. start with chr22) to see if you get the same issue. It's likely to be a RAM issue, we are thinking of releasing a more RAM-efficient version later but it really depends on whether we have time to do so or not.

mikem
New Member

Posts: 2

MLMA sample size limit Dec 10, 2015 11:48:12 GMT

Quote

Post by mikem on Dec 10, 2015 11:48:12 GMT

Thanks Jian,
The GRM was generate from a subset of variants, approx 10k. I get the same issue when trying to perform a bivariate REML analysis using ~10K SNPs.

I certainly hope you decide to release a more memory efficient version for my own selfish reasons!