Format of the input data files

4. Format of the input data files

In this section we will discuss the format of the input files that are needed for the executable to run. The file names are specified in the parameter file used by the program. The data in all the input files should be white space separated or tab separated.

Marker file (snpname):

This contains information about the markers being used for the analysis. The format and an example of the file is as follows:

SNP_ID	Chr	Gen_ Pos	Phys_ Pos	PopA_ vart_cnt	PopA_ ref_cnt	PopB_ vart_cnt	PopB_ ref_cnt
rs897634	1	0.031621	2618675	21	189	242	84
rs905135	1	0.035690	2982467	35	71	281	19
CV1944294	1	0.067986	4380773	42	64	277	21

Here Chr_Num is the chromosome number, Gen_Pos and Phys_Pos are genetic and physical positions. PopA_vart_cnt and PopA_ref_cnt are the variant and reference allele counts in the parental samples of population A, and the last two columns are these counts for the parental samples of the population type B.

The genetic position can be in Morgans or centiMorgans, and valid values for chromosome_num range from 1 to 23, or 1 to 22 and X. The markers can be arranged in any order in this file, and don’t have to be sorted by chromosome number or any other field. Currently the algorithm does not support the Y chromosome, mitochondrial DNA, or the pseudoautosomal region of the X chromosome.

One could alternatively use a file which has only the first four columns, that is, with no parental counts. This will probably lead to reasonable results, however with lower statistical power, and the user should be cautious about the results in this case. If the user has a marker file which has just the first four columns and a genotype file for the parental populations, one can generate the file in the above format using the program cntmono, which is described in detail in Section 11. Before using the output file created by cntmono as the input marker file for ancestrymap, remove from it the blank lines, lines with comments and, the header line. This file should only contain details about the markers, else ancestrymap will give a fatal error.

Badsnps file (badsnpname):

This is a list of markers that one would like to exclude from the analysis. These could be markers that fail any of the tests described in Section 5 that are performed during the initial phases of running ANCESTRYMAP, by setting the checkit field to YES in the input parameter file. In addition, one should also exclude one of the pairs of markers which are in strong linkage disequilibrium with each other.

SNP_ID

rs578459

CV2800274

rs73494

Individual file (indivname):

This has information about the individuals that we are going to use for analysis.

Indiv_ID	Gender	Status
I1	M	Control
I2	M	Case
I3	M	Ignore

The gender field can be M (male), F (female) or U for samples with unknown gender. The status field can be Case, Control or Ignore, where the samples that have status set as Ignore are excluded from the analysis. One can use this field effectively without having to create a new individual file each time we want to analyze the same sample set for a different hypothesis. For example if we have data from case and control samples for multiple diseases (ex. Multiple Sclerosis and Prostate Cancer), and say we want to analyze output from the ANCESTRYMAP only for MS. Then we might want to use the controls for both the diseases as controls, MS cases as cases, and set the Prostate Cancer cases as Ignore. Also, if during the course of analyzing a data set we realize that there is a problem with a particular sample (ex. contaminated DNA) we can set the Status field to Ignore and that would remove this sample from our analysis.

Genotype file (genotypename):

This has the genotypes for all the individuals and markers that are listed in the above two files.

SNP_ID	Indiv_Id	Vart_allele_cnt
rs1865056	I1	0
rs1865056	I2	1
rs1865056	I3	0

Note that there is a fatal error if one has markers and individuals mentioned in the genotype file, which have not been specified in the marker and individual files respectively. The possible values for the variant allele count are 0, 1, or 2. The variant allele count for men on the X-chromosome can be given only as 0 or 1, with 2 being an invalid value in this case. Missing data can be specified by -1, or not mentioned at all. An individual with a large amount of missing data will cause ANCESTRYMAP to behave badly, and it might be a good idea to ignore these individuals in the analysis, by setting their Status field to Ignore.

The genotype file can be given as a zipped .gz file as well, which the program will unzip and use.