Format of the input data files

4. Format of the input data files

In this section we will discuss the format of the input files that are needed for the executable to run. The file names are specified in the parameter file used by the program. The data in all the input files should be white space separated or tab separated.

  • Marker file (snpname):

This contains information about the markers being used for the analysis. The format and an example of the file is as follows:

 

SNP_ID

Chr

Gen_

Pos

Phys_

Pos

PopA_

vart_cnt

PopA_

ref_cnt

PopB_

vart_cnt

PopB_

ref_cnt

rs897634                 

0.031621        

2618675

21

189

242   

84

rs905135  

1

0.035690

2982467   

35   

71  

281   

19

CV1944294  

1       

0.067986        

4380773   

42   

64  

277   

21

 

Here Chr_Num is the chromosome number, Gen_Pos and Phys_Pos are genetic and physical positions. PopA_vart_cnt and PopA_ref_cnt are the variant and reference allele counts in the parental samples of population A, and the last two columns are these counts for the parental samples of the population type B.

The genetic position can be in Morgans or centiMorgans, and valid values for chromosome_num range from 1 to 23, or 1 to 22 and X. The markers can be arranged in any order in this file, and don’t have to be sorted by chromosome number or any other field. Currently the algorithm does not support the Y chromosome, mitochondrial DNA, or the pseudoautosomal region of the X chromosome.

One could alternatively use a file which has only the first four columns, that is, with no parental counts. This will probably lead to reasonable results, however with lower statistical power, and the user should be cautious about the results in this case. If the user has a marker file which has just the first four columns and a genotype file for the parental populations, one can generate the file in the above format using the program cntmono, which is described in detail in Section 11. Before using the output file created by cntmono as the input marker file for ancestrymap, remove from it the blank lines, lines with comments and, the header line. This file should only contain details about the markers, else ancestrymap will give a fatal error.

  • Badsnps file (badsnpname):

This is a list of markers that one would like to exclude from the analysis. These could be markers that fail any of the tests described in Section 5 that are performed during the initial phases of running ANCESTRYMAP, by setting the checkit field to YES in the input parameter file. In addition, one should also exclude one of the pairs of markers which are in strong linkage disequilibrium with each other.

 

SNP_ID

rs578459

CV2800274

rs73494

 

  • Individual file (indivname):     

This has information about the individuals that we are going to use for analysis.

 

Indiv_ID

Gender

Status

I1

M     

Control

I2

M     

Case

I3

M        

Ignore

 

The gender field can be M (male), F (female) or U for samples with unknown gender. The status field can be Case, Control or Ignore, where the samples that have status set as Ignore are excluded from the analysis. One can use this field effectively without having to create a new individual file each time we want to analyze the same sample set for a different hypothesis. For example if we have data from case and control samples for multiple diseases (ex. Multiple Sclerosis and Prostate Cancer), and say we want to analyze output from the ANCESTRYMAP only for MS. Then we might want to use the controls for both the diseases as controls, MS cases as cases, and set the Prostate Cancer cases as Ignore. Also, if during the course of analyzing a data set we realize that there is a problem with a particular sample (ex. contaminated DNA) we can set the Status field to Ignore and that would remove this sample from our analysis.

 

  • Genotype file (genotypename):

This has the genotypes for all the individuals and markers that are listed in the above two files.

 

SNP_ID

Indiv_Id

Vart_allele_cnt

rs1865056        

I1

0

rs1865056      

I2

1

rs1865056      

I3

0

 

Note that there is a fatal error if one has markers and individuals mentioned in the genotype file, which have not been specified in the marker and individual files respectively. The possible values for the variant allele count are 0, 1, or 2. The variant allele count for men on the X-chromosome can be given only as 0 or 1, with 2 being an invalid value in this case. Missing data can be specified by -1, or not mentioned at all. An individual with a large amount of missing data will cause ANCESTRYMAP to behave badly, and it might be a good idea to ignore these individuals in the analysis, by setting their Status field to Ignore.

 

The genotype file can be given as a zipped .gz file as well, which the program will unzip and use.