I am trying to test the limits of GCTA --mlma capacity by steadily increasing the number of samples I am analyzing with a linear mixed model. I am using a pre-computed GRM, and have tested the MLM analysis with 12k, 14k, 25k and 40k individuals. The compute time increases in non-linear time (probably to be expected). The GRM is re-computed for each new sample size.
However, when I attempt to run the MLM analysis with 50K samples I get a memory error: terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc
I have generally found that this arises when the individuals in the GRM and input genotype/phenotype files are not perfectly matched for smaller samples sizes. I am running it on a compute node with >300GB of RAM, and the maximum observed mem usage is ~78G. There is one other process on that node which has reserved 60G virtual memory. Each run is using 20 CPUs.
I'm running the latest version of GCTA (v1.25.0). Whilst I expect there to be an increase in compute time. I had failure issues previously with duplicated IDs (v 1.24.7).
Is this a natural sample size limit for GCTA, or is there scope to expand this further? My aim is to be able to analyse ~150k individuals with this MLM. I have looked into using BOLT-LMM, however, in it's current incarnation it does not take a pre-computed GRM.
You might try to run this on individual chromosomes separately (e.g. start with chr22) to see if you get the same issue. It's likely to be a RAM issue, we are thinking of releasing a more RAM-efficient version later but it really depends on whether we have time to do so or not.