GCTA - Data management | Complex Trait Genetics Forum

GCTA - Data management Oct 31, 2015 2:20:47 GMT

Post by Jian Yang on Oct 31, 2015 2:20:47 GMT

--keep test.indi.list
Specify a list of individuals to be included in the analysis.

--remove test.indi.list
Specify a list of individuals to be excluded from the analysis.

--chr 1
Include SNPs on a specific chromosome in the analysis, e.g. chromosome 1.

--autosome-num 22
Specify the number of autosomes for a species other than human. For example, if you specify the number of autosomes to be 19, then chromosomes 1 to 19 will be recognized as autosomes and chromosome 20 will be recognized as the X chromosome. The default number is 22 if this option not specified.

--autosome
Include SNPs on all of the autosomes in the analysis.

--extract test.snplist
Specify a list of SNPs to be included in the analysis.
Input file format

test.snplist
rs103645
rs175292
……

--exclude test.snplist
Specify a list of SNPs to be excluded from the analysis.

--extract-snp rs123678
Specify a SNP to be included in the analysis.

--exclude-snp rs123678
Specify a single SNP to be excluded from the analysis.

--extract-region-snp rs123678 1000
Extract a region centred around a specified SNP, e.g. +-1000Kb region centred around rs123678.

--exclude-region-snp rs123678 1000
Exclude a region centred around a specified SNP, e.g. +-1000Kb region centred around rs123678.

--extract-region-snp 1 120000 1000
Extract a region centred around a specified bp, e.g. +-1000Kb region centred around 120,000bp of chr 1.

--exclude-region-snp 1 120000 1000
Exclude a region centred around a specified bp, e.g. +-1000Kb region centred around 120,000bp of chr 1. This option is particularly useful for a analysis excluding the MHC region.

--maf 0.01
Exclude SNPs with minor allele frequency (MAF) less than a specified value, e.g. 0.01.

--max-maf 0.1
Include SNPs with MAF less than a specified value, e.g. 0.1.

--update-sex test.indi.sex.list
Update sex information of the individuals from a file.
Input file format
test.indi.sex.list (no header line; columns are family ID, individual ID and sex). Sex coding: “1” or “M” for male and “2” or “F” for female.

--update-ref-allele test_reference_allele.txt
Assign a list of alleles to be the reference alleles for the SNPs included in the analysis. By default, the first allele listed in the *.bim file (the 5th coloumn) or *.mlinfo.gz file (the 2nd conlumn) is assigned to be the reference allele. NOTE: This option is invalid for the imputed dosage data only.
Input file format
test_reference_allele.txt (no header line; columns are SNP ID and reference allele)

rs103645 A
rs175292 G
……

--imput-rsq 0.3
Include SNPs with imputation R2 (squared correlation between imputed and true genotypes) larger than a specified value, e.g. 0.3.

--update-imput-rsq test.imput.rsq
Update imputation R2 from a file. For the imputed dosage data, you do not have to use this option because GCTA can read the imputation R2 from the *.mlinfo.gz file unless you want to write them. For the best guess data (usually in PLINK format), if you want to use a R2 cut-off to filter SNPs, you need to use this option to read the imputation R2 values from the specified file.
Input file format
test.imput.rsq (no header line; columns are SNP ID and imputation R2)

rs103645 0.976
rs175292 1.000
……

--freq
Output allele frequencies of the SNPs included in the analysis (in plain text format), e.g.
Output file format
test.freq (no header line; columns are SNP ID, reference allele and its frequency)

rs103645 A 0.312
rs175292 G 0.602
……

--update-freq test.freq
Update allele frequencies of the SNPs from a file rather than calculating from the data. The format of the input file is the same as the output format for the option --freq.

--recode
Output SNP genotypes based on additive model (i.e. x coded as 0, 1 or 2) in compressed text format, e.g. test.xmat.gz.
--recode-nomiss
Output SNP genotypes based on additive model without missing data. Missing genotypes are replaced by their expected values i.e. 2p where p is the frequency of the coded allele (also called the reference allele) of a SNP.
--recode-std
Output standardised SNP genotypes without missing data. The standardised genotype is w = (x - 2p) / sqrt[2p(1-p)]. Missing genotypes are replaced by zero.
Output file format
test.xmat.gz (The first line contains family ID, individual ID and SNP ID. The second line contains two nonsense words “Reference Allele” and the reference alleles of the SNPs. Missing genotype is represented by “NA”).

FID IID rs103645 rs175292
Reference Allele A G 
011 0101 1 0
012 0102 2 NA
013 0103 0 1
……

--make-bed
Save the genotype data in PLINK binary PED files (*.fam, *.bim and *.bed).

Example
# Convert MACH dosage data to PLINK binary PED format
gcta64 --dosage-mach test.mldose.gz test.mlinfo.gz --make-bed --out test
Note: the --dosage-mach option was designed to read output files from an early version of MACH, which might not be compatible with output files from the latest version of MACH or Minimac.