Input File Formats and Conversion Program

9.0 Input File Formats and Conversion Program

This file contains documentation of the program convertf, which converts between the 5 different file formats we support.  Note that "file format" simultaneously refers to the formats of three distinct files:

  • genotype file: contains genotype data for each individual at each SNP
  • snp file:      contains information about each SNP
  • indiv file:    contains information about each individual

Below, we document all 5 formats:

  • ANCESTRYMAP
  • EIGENSTRAT
  • PED
  • PACKEDPED
  • PACKEDANCESTRYMAP

and we explain how to use convertf to get from one format to another. Note all the example files are in the directory:

ANCESTRYMAP Format:

  • genotype file: see example.ancestrymapgeno
  • snp file:      see example.snp
  • indiv file:    see example.ind

The genotype file contains 1 line per valid genotype, and has 3 columns:

SNP_ID

Sample_ID

Number of Variant Alleles (0,1 or 2)

Missing genotypes are encoded by the absence of an entry in the genotype file.

The snp file contains 1 line per SNP.  There are 4 columns:

SNP_ID

Chromosome_Num

Genetic_Position

Physical_Position

 Use 23 for X chromosome. The genetic position can be in Morgans or centiMorgans, and the physical position is in bases.

The indiv file contains 1 line per individual, and has 3 columns:

Sample_ID

Gender

Status

 The gender column can be M(male), F(female) or U (unknown). The status column might refer to Case or Control status, or might be a population group label.  If this entry is set to "Ignore", then that individual and all genotype data from that individual will be removed from the data set in all convertf output. The name "ANCESTRYMAP format" is used for historical reasons only.  This software is completely independent of our 2004 ANCESTRYMAP software.

EIGENSTRAT Format: Used by EIGENSTRAT (both in the 07/23/06 release and in the current release).

  • §         genotype file: see example.eigenstratgeno
  • §         snp file:      see example.snp (same as above)
  • §         indiv file:    see example.ind (same as above)

The genotype file contains 1 line per SNP. Each line contains 1 character per individual:

  0 means zero copies of reference allele.

  1 means one copy of reference allele.

  2 means two copies of reference allele.

  9 means missing data.

The program ind2pheno.perl in this directory will convert from example.ind to the example.pheno file needed by the EIGENSTRAT software. To run this script type on the command line:

>> ./ind2pheno.perl example.ind example.pheno

PED Format:

  • genotype file: see example.ped    *** file name MUST end in .ped ***
  • snp file:      see example.pedsnp *** file name MUST end in .pedsnp *** convertf also supports .map suffix for this input file name
  • indiv file:    see example.pedind *** file name MUST end in .pedind ***and Conversion between various formats

convertf also supports the full .ped file (example.ped) for this input file

Note that, mandatory suffix names enable our software to recognize this file format.

The indiv file contains the first 7 columns of the genotype file (see below).

The genotype file is 1 line per individual.  Each line contains 7 columns of information about the individual, plus two genotype columns for each SNP in the order the SNPs are specified in the snp file. 

 The first 7 columns are:

  • 1st column is family ID.
  • 2nd column is sample ID.
  • 3rd and 4th column are sample IDs of parents.
  • 5th column is gender (male is 1, female is 2)
  • 6th column is case/control status (1 is control, 2 is case) OR quantitative trait value OR population group label.
  • 7th column (this column is optional) is always set to 1. 

 convertf does not support pedigree information, so 1st, 3rd, 4th columns are ignored in convertf input and set to arbitrary values in convertf output. In the two genotype columns for each SNP, missing data is represented by 0.

The snp file contains 1 line per SNP.  There are 4 columns:

Chromosome_Num

SNP_ID

Genetic_Position

Physical_Position

Use X for X chromosome. The genetic position is in Morgans, and the physical position is in bases.

The indiv file contains the first 7 columns of the genotype file.

The PED format is used by the PLINK package of Shaun Purcell. See https://www.cog-genomics.org/plink2

PACKEDPED Format:

  • genotype file: see example.bed    *** file name MUST end in .bed ***
  • snp file:      see example.pedsnp *** file name MUST end in .pedsnp ***
  • convertf also supports .map suffix for this input file name
  • indiv file:    see example.pedind *** file name MUST end in .pedind ***

            convertf also supports a .ped file (example.ped) for this input file

Note that, mandatory suffix names enable our software to recognize this file format.

example.bed is a packed binary file (2 bits per genotype).

The PACKEDPED format is used by the PLINK package of Shaun Purcell. See https://www.cog-genomics.org/plink2

For input in PACKEDPED format, snp file MUST be in genomewide order.

For input in PACKEDPED format, genotype file MUST be in SNP-major order (the PLINK default: see PLINK documentation for details.)

PACKEDANCESTRYMAP Format:

  • genotype file: see example.packedancestrymapgeno
  • snp file:      see example.snp (same as above)
  • indiv file:    see example.ind (same as above)

Note that, example.packedancestrymapgeno is a packed binary file (2 bits per genotype).

DOCUMENTATION OF convertf program:

To run this program type on the command line:

>> /bin/convertf -p parfile

We illustrate how parfile works via a toy example: (see example.perl in this directory)

par.ANCESTRYMAP.EIGENSTRAT        converts ANCESTRYMAP to EIGENSTRAT format

par.EIGENSTRAT.PED                converts EIGENSTRAT to PED format

par.PED.EIGENSTRAT                converts PED to EIGENSTRAT format

par.PED.PACKEDPED                 converts PED to PACKEDPED format

par.PACKEDPED.PACKEDANCESTRYMAP   converts PACKEDPED to PACKEDANCESTRYMAP

par.PACKEDANCESTRYMAP.ANCESTRYMAP converts PACKEDANCESTRYMAP to ANCESTRYMAP

Note that the choice of which allele is the reference allele may be arbitrary and thus converting to a new format and back again may change the choice of reference allele.

DESCRIPTION OF EACH PARAMETER in parfile for convertf:

Parameter Name

Data type

Description

Possible and Default values

genotypename

String

input genotype file

 

snpname

String

input snp file

 

outputformat

String

Can be one of the following:

ANCESTRYMAP,  EIGENSTRAT, PED, PACKEDPED or PACKEDANCESTRYMAP

 

genotypeoutname

String

output genotype file

 

snpoutname

String

output snp file

 

indivoutname

String

output indiv file

 

OPTIONAL PARAMETERS

familynames

String

Only relevant if input format is PED or PACKEDPED.

 

noxdata

Boolean

If set to YES, all SNPs on X chromosome are removed from the data set.

 

nomalexhet

Boolean

If set to YES, any het genotypes on X chr for males are changed to missing data

 

badsnpname

String

Specifies a list of SNPs which should be removed from the data set

 

outputgroup

Boolean

Only relevant if outputformat is PED or PACKEDPED

NO

  • familynames : If set to YES, then family ID will be concatenated to sample ID. This supports different individuals with different family ID but same sample ID.  The convertf default for this parameter is YES.
  • noxdata: The convertf default for this parameter is NO.
  • nomalexhet: The convertf default for this parameter is NO.
  • outputgroup: This parameter specifies what the 6th column of information about each individual should be in the output. If outputgroup is set to NO (the default), the 6th column will be set to 1 for each Control and 2 for each Case, as specified in the input indiv file. [Individuals specified with some other label, such as a population group label, will be assumed to be controls and the 6th column will be set to 1.] If outputgroup is set to YES, the 6th column will be set to the exact label specified in the input indiv file. [This functionality preserves population group labels.]