9.0 Input File Formats and Conversion Program
This file contains documentation of the program convertf, which converts between the 5 different file formats we support. Note that "file format" simultaneously refers to the formats of three distinct files:
- genotype file: contains genotype data for each individual at each SNP
- snp file: contains information about each SNP
- indiv file: contains information about each individual
Below, we document all 5 formats:
- ANCESTRYMAP
- EIGENSTRAT
- PED
- PACKEDPED
- PACKEDANCESTRYMAP
and we explain how to use convertf to get from one format to another. Note all the example files are in the directory:
ANCESTRYMAP Format:
- genotype file: see example.ancestrymapgeno
- snp file: see example.snp
- indiv file: see example.ind
The genotype file contains 1 line per valid genotype, and has 3 columns:
SNP_ID |
Sample_ID |
Number of Variant Alleles (0,1 or 2) |
Missing genotypes are encoded by the absence of an entry in the genotype file.
The snp file contains 1 line per SNP. There are 4 columns:
SNP_ID |
Chromosome_Num |
Genetic_Position |
Physical_Position |
Use 23 for X chromosome. The genetic position can be in Morgans or centiMorgans, and the physical position is in bases.
The indiv file contains 1 line per individual, and has 3 columns:
Sample_ID |
Gender |
Status |
The gender column can be M(male), F(female) or U (unknown). The status column might refer to Case or Control status, or might be a population group label. If this entry is set to "Ignore", then that individual and all genotype data from that individual will be removed from the data set in all convertf output. The name "ANCESTRYMAP format" is used for historical reasons only. This software is completely independent of our 2004 ANCESTRYMAP software.
EIGENSTRAT Format: Used by EIGENSTRAT (both in the 07/23/06 release and in the current release).
- § genotype file: see example.eigenstratgeno
- § snp file: see example.snp (same as above)
- § indiv file: see example.ind (same as above)
The genotype file contains 1 line per SNP. Each line contains 1 character per individual:
0 means zero copies of reference allele.
1 means one copy of reference allele.
2 means two copies of reference allele.
9 means missing data.
The program ind2pheno.perl in this directory will convert from example.ind to the example.pheno file needed by the EIGENSTRAT software. To run this script type on the command line:
>> ./ind2pheno.perl example.ind example.pheno
PED Format:
- genotype file: see example.ped *** file name MUST end in .ped ***
- snp file: see example.pedsnp *** file name MUST end in .pedsnp *** convertf also supports .map suffix for this input file name
- indiv file: see example.pedind *** file name MUST end in .pedind ***and Conversion between various formats
convertf also supports the full .ped file (example.ped) for this input file
Note that, mandatory suffix names enable our software to recognize this file format.
The indiv file contains the first 7 columns of the genotype file (see below).
The genotype file is 1 line per individual. Each line contains 7 columns of information about the individual, plus two genotype columns for each SNP in the order the SNPs are specified in the snp file.
The first 7 columns are:
- 1st column is family ID.
- 2nd column is sample ID.
- 3rd and 4th column are sample IDs of parents.
- 5th column is gender (male is 1, female is 2)
- 6th column is case/control status (1 is control, 2 is case) OR quantitative trait value OR population group label.
- 7th column (this column is optional) is always set to 1.
convertf does not support pedigree information, so 1st, 3rd, 4th columns are ignored in convertf input and set to arbitrary values in convertf output. In the two genotype columns for each SNP, missing data is represented by 0.
The snp file contains 1 line per SNP. There are 4 columns:
Chromosome_Num |
SNP_ID |
Genetic_Position |
Physical_Position |
Use X for X chromosome. The genetic position is in Morgans, and the physical position is in bases.
The indiv file contains the first 7 columns of the genotype file.
The PED format is used by the PLINK package of Shaun Purcell. See https://www.cog-genomics.org/plink2
PACKEDPED Format:
- genotype file: see example.bed *** file name MUST end in .bed ***
- snp file: see example.pedsnp *** file name MUST end in .pedsnp ***
- convertf also supports .map suffix for this input file name
- indiv file: see example.pedind *** file name MUST end in .pedind ***
convertf also supports a .ped file (example.ped) for this input file
Note that, mandatory suffix names enable our software to recognize this file format.
example.bed is a packed binary file (2 bits per genotype).
The PACKEDPED format is used by the PLINK package of Shaun Purcell. See https://www.cog-genomics.org/plink2
For input in PACKEDPED format, snp file MUST be in genomewide order.
For input in PACKEDPED format, genotype file MUST be in SNP-major order (the PLINK default: see PLINK documentation for details.)
PACKEDANCESTRYMAP Format:
- genotype file: see example.packedancestrymapgeno
- snp file: see example.snp (same as above)
- indiv file: see example.ind (same as above)
Note that, example.packedancestrymapgeno is a packed binary file (2 bits per genotype).
DOCUMENTATION OF convertf program:
To run this program type on the command line:
>> /bin/convertf -p parfile
We illustrate how parfile works via a toy example: (see example.perl in this directory)
par.ANCESTRYMAP.EIGENSTRAT converts ANCESTRYMAP to EIGENSTRAT format
par.EIGENSTRAT.PED converts EIGENSTRAT to PED format
par.PED.EIGENSTRAT converts PED to EIGENSTRAT format
par.PED.PACKEDPED converts PED to PACKEDPED format
par.PACKEDPED.PACKEDANCESTRYMAP converts PACKEDPED to PACKEDANCESTRYMAP
par.PACKEDANCESTRYMAP.ANCESTRYMAP converts PACKEDANCESTRYMAP to ANCESTRYMAP
Note that the choice of which allele is the reference allele may be arbitrary and thus converting to a new format and back again may change the choice of reference allele.
DESCRIPTION OF EACH PARAMETER in parfile for convertf:
Parameter Name |
Data type |
Description |
Possible and Default values |
genotypename |
String |
input genotype file |
|
snpname |
String |
input snp file |
|
outputformat |
String |
Can be one of the following: ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP |
|
genotypeoutname |
String |
output genotype file |
|
snpoutname |
String |
output snp file |
|
indivoutname |
String |
output indiv file |
|
OPTIONAL PARAMETERS |
|||
familynames |
String |
Only relevant if input format is PED or PACKEDPED. |
|
noxdata |
Boolean |
If set to YES, all SNPs on X chromosome are removed from the data set. |
|
nomalexhet |
Boolean |
If set to YES, any het genotypes on X chr for males are changed to missing data |
|
badsnpname |
String |
Specifies a list of SNPs which should be removed from the data set |
|
outputgroup |
Boolean |
Only relevant if outputformat is PED or PACKEDPED |
NO |
- familynames : If set to YES, then family ID will be concatenated to sample ID. This supports different individuals with different family ID but same sample ID. The convertf default for this parameter is YES.
- noxdata: The convertf default for this parameter is NO.
- nomalexhet: The convertf default for this parameter is NO.
- outputgroup: This parameter specifies what the 6th column of information about each individual should be in the output. If outputgroup is set to NO (the default), the 6th column will be set to 1 for each Control and 2 for each Case, as specified in the input indiv file. [Individuals specified with some other label, such as a population group label, will be assumed to be controls and the 6th column will be set to 1.] If outputgroup is set to YES, the 6th column will be set to the exact label specified in the input indiv file. [This functionality preserves population group labels.]