
Post by Jian Yang on Sept 21, 2015 1:22:32 GMT
1. Making a GRM
This process involves genotype data in PLNK format, a SNP genotype matrix, a GRM, and a n x n matrix of the number of SNPs used for GRM calculation.
Size of the n x m genotype matrix in PLINK binary format (2 bits per genotype) = m * n / 4
Size of GRM in double precision float = n * n * 8 bytes
n x n matrix for the number of SNPs used to calculate GRM in single precision = n * n * 4 bytes
Size of SNP genotype matrix in single precision float = m * n * 4 bytes, where m is the number of SNPs
Total memory usage ~= m * n / 4 + m * n * 4 + n * n * 8 + n * n * 4 = (4.25 * m + 12 * n) * n bytes
This is usually very large for 1000G imputed data in particular. I would recommend running the the analysis per chromosome and then merging the GRMs.
2. REML analysis
The REML process is a bit complicated. It involves a number of n x n matrices, e.g. GRM, variancecovariance V matrix, the projection P matrix and temporary matrices for V inverse calculation.
Total memory usage ~= (t + 4) * n * n * 8 bytes, where t is the number of genetic components (i.e. the number of GRMs) fitted in the model.
Note that these calculations haven't taken into account vectors and the other matrices of smaller size. Therefore, to submit a job to a computer cluster I would request 20% more memory than the predicted amount.

