kjkan
New Member
Posts: 3
|
Post by kjkan on Nov 26, 2014 18:00:25 GMT
Hi all, My supervisor is interested in carrying out a (bivariate) GCTA. He has available PLINK dosage data in single dose format (once created using dose2plink), but I believe the GCTA tool does not support this kind of data as input (but correct me if I'm wrong!). We came up with the following two solutions. 1) Convert the PLINK dosage data to MACH dosage format and run GCTA using option ----dosage-mach (or --dosage-mach-gz). 2) Convert the PLINK dosage data to PLINK bed format and run GCTA using option --bfile. Both solutions are problematic in the sense that I couldn't find any tools to convert the PLINK single dosage data to MACH or PLINK bed. The tool fcgene, for example, only accepts PLINK dosage in the format of two probabilities (but again, please correct me if I'm wrong!). My questions to you is: What is the best way to go? Two remarks in advance: a) Of course, in the end we can write our scripts to convert the data, but existing tools are preferred (e.g., because we can cite these in a paper). b) I'm aware that the easiest way would have been to use original MACH output files, but, alas, these are not available anymore. Hoping you can help me here and thanks in advance, Kees-Jan Kan
|
|
|
Post by chrchang on Dec 7, 2014 5:00:28 GMT
I'm not aware of any existing tools for this; you're best off writing a short script to convert your data into an intermediate format which fcGene and/or PLINK can then convert to PLINK binary format. (Or you could write a program to perform the full conversion; that takes more work, though.) I will try to fix this state of affairs in 2015. In the meantime, you might find the memory-efficient dose2plink implementation at github.com/chrchang/plink-ng/blob/master/dose2plink.c to be handy.
|
|
kjkan
New Member
Posts: 3
|
Post by kjkan on Dec 7, 2014 16:25:08 GMT
Many thanks for your reply. Good to know there were no existing tools available (yet). In the end, we wrote a script ourselves indeed. We later compared it with the dose2plink script at genepi.qimr.edu.au/staff/sarahMe/mach2merlin/dose2plink.pl (I assume that's the same script as the one you are referring to). Can I ask one more GCTA question? What information does GCTA use from .mlinfo files? This in view of filtering and of the following: We've run a GWAS on the dosage data in PLINK in order to create a PLINK .assoc.dosage file. We wrote a script to convert this file to a .mlinfo style file. We assumed that PLINK's INFO column ('INFO R-squared quality metric / information content') corresponds to the .mlinfo Rsq column and we inserted a dummy code (1) in place of the .mlinfo Quality column. MAF was calculated on the basis of the frequencies of A1 and A2. Does CGTA read any of those columns or does it calculate the R^2 (and/or MAF) itself? Thanks, Kees-Jan
|
|
|
Post by chrchang on Dec 7, 2014 21:28:10 GMT
As far as I can tell: * the Rsq column can be used for filtering out SNPs, with GCTA's --imput-rsq flag. * GCTA ignores the quality column.
|
|
kjkan
New Member
Posts: 3
|
Post by kjkan on Dec 7, 2014 23:37:11 GMT
Ok, thanks! That's reassuring.
Kees-Jan
|
|