In this section we will guide the user throughANCESTRYMAP, using a set of example files provided as part of the download. Weshall also go through the steps needed to run user specific data on thesoftware.
The first step is to download the software:
The software is currently available for the following platforms:
UNIX , Linux and MAC.
Click on download software for UNIX , Linux or MAC.
Make sure to rename the downloaded file to ancestrymap.tar.gz in case the download process has changed the name.
Unzip the file ancestrymap.tar.gz by typing on the command line
>> gzip –d ancestrymap.tar.gz
You should now see the archive ancestrymap.tar in your directory.
Next unarchive ancestrymap.tar.gz by typing on the command line:
>> tar –xvf ancestrymap.tar
This should give you the following directory structure under a directory called ancestrymap/:
§ README file
In the examplefiles directory we have the following files:
- paramfile: In the format of parameter file for ancestrymap, with additional parameters that are new in Version 2.0
- parmono: In the format of parameter file for cntmono
- param0, param1, param2: Parameter files for ancestrymap, discussed later in this section
- parsim: Parameter file for ancestrymap when running simulations
- parsim2d: Parameter file for ancestrymap when running fine-mapping simulations
- paramped: Parameter file for ancestrymap when using PED files
- par:8001-par:8009 : Parameter files for fine-mapping runs as generated by running the script mkfine.
- Example files as used by convertf executable, look at the relevant documentation for details
Input Data Files:
- indiv.dat: individual input file for ancestrymap
- indiv1.dat : Individual file for ancestrymap with one samples set to Ignore
- geno.dat: genotype input file for ancestrymap
- snpcnts: marker input file for ancestrymap
- badsnps: input file for ancestrymap with markers that need to be removed from the analysis
- snps: marker input file for cntmono or ancestrymap
- aflist, eurlist: Ancestry files for cntmono
The genotype and individual files in this directory were generated by running simulations, and the marker files correspond to data reported in the Smith et al paper.
- out2.dat Output file generated by running ancestrymap using paramfile
- out0.dat: Output file generated by running ancestrymap using param0
- out1.dat: Output file generated by running ancestrymap using param1
- outsim2d.dat : Output file generated by running ancestrymap using parsim2d
- admckout.dat: Ouput file generated by running admcheck on out1.dat
- indjunk: Output file with detailed individual data created by running ancestrymap using param0
- snpjunk: Output file with detailed marker data created by running ancestrymap using param0
- Fine-mapping Run Output Files:
- badlist1, framelist1: Output files generated by mkfine script
- badlist:8001-badlist:8009: Bad marker files generated by mkfine for each fine-mapping run
- xx:8001-xx:8009: Output file for each fine-mapping run
- outfiles/: This is a directory which contains the output files mentioned in Section 6, and ancestry estimates for various markers considered in the fine-mapping run (gams:8001-gams:8009).
bin/ has the following executables:
- admcheck: A perl script which is used to extract the top “bad” markers
- mkfine: A perl script which is used to kick-off the fine-mapping runs
- parfine.temp : A accompanying template parameter file needed by the mkfine script
- addcol, uniqit: Helper perl scripts needed by mkfine
src/ has the C source code for making the ancestrymap, cntmono, baseprog, convertf executables, the library nicklib.a and a makefile called Makefile. The makefile can be sued to make just the individual executables, or just the library or all together. This has the following directories under it:
- smartinclude/ has header files which are needed by the source code, users should not delete these files to ensure proper compilation of the code.
- smarttables/ is needed by the source code.
- nicksrc/: nicklib source code
After the user has successfully downloaded all the files, the next step is to run the example files included. In the next few sections we will discuss the steps involved, where at each step we will focus on a particular parameter file and its corresponding output file.
First step makes sure that the input files are in the right format. In this step we look at the parameter file parbaseprog and it’s corresponding output file outbaseprog.dat. To run this type on the command line in the examples directory:
>> ./ancestrymap –p parbaseprog > outbaseprog.dat&
If there is any problem with any of the input files, one will see an appropriate message in the output file.
Next step performs a couple of data checks. In this step we look at the parameter file param0, and its corresponding output file out0.dat. The key parameter values in this file are numburn = 0, numiters= 0, checkit = YES and details = YES.
To run this parameter file type on the command line in the examples directory:
>> ./ancestrymap –p param0 > outp0&
Compare the output files out0.dat (in the examples directory) and outp0 to make sure that you can understand the output generated. Note that the use of the random number generator makes it impossible for the results to be exactly the same for two runs unless the parameter seed has the same value.
Next, look at the output file indjunk generated by this run. From this file one can extract a list of individuals with very small number of genotypes by sorting it by the Num_valid_genotypes column. We will set the Status field to Ignore for some of these individuals in a copy of the original individual file called indiv1.dat file, since a lot of missing data will cause ANCESTRYMAP to behave badly. Also, one should discard markers which have low parental genotype counts by looking at the file snpjunk which can be done by looking at the fields PopA_vart, PopA_ref, PopB_vart and PopB_ref in this file. The discarded markers can be put in the “badsnpname” file. Next look at the output file out0.dat, where the checkdup and fastdup programs have flagged a number of duplicate individuals. We shall set the Status field to Ignore for one of these pair of individuals in the indiv1.dat file as well.
Thus the key focus in this step is to ensure that ANCESTRYMAP can successfully process the input files, and the identification of individuals which are duplicates or have very few genotypes, and markers with low parental genotype counts.
The next step involves running a lot of data checking programs. In this step we will look at the parameter file param1, and its corresponding output file out1.dat. The key parameter values in this file are numburn = 5, numiters = 5, checkit = YES and details = YES. This corresponds to having very few burn-in or follow-on iterations and sets up ANCESTRYMAP in the mode to run the various data checking programs.
To run this parameter file type on the command line in the examplefiles directory:
>>./ancestrymap –p param1 > outp1&
Compare the output files out1.dat (in the examples directory) and outp1 to make sure you can understand the various output sections. Note that the use of the random number generator makes it impossible for the output to be exactly the same for two runs unless the parameter seed has the same value.
Note that in the output file there are results from a large number of data checking programs. To extract the top markers that have failed the various checks run the perl script admcheck by typing on the command line:
>>admcheck out1.dat > ancsycheck.dat&
Compare the file ancsycheck.dat with the file admchkout.dat in the examples directory.
Here is an example of the output generated by admcheck and pointers on how to extract the bad markers.
From the ancsycheck.dat file we will pick the markers that are outliers for the various checks, and will add them to our badsnpname file which will allow the software to ignore these markers for the rest of the analysis. In addition, the user must also add to this file one of the pairs of markers which are in strong linkage disequilibrium with each other. It is necessary to remove these markers since otherwise one will see spurious results. Note that since we don’t really have any bad markers, the badsnps file in the examples directory is just a sample file.
Next we will look at the parameter file param2, and its corresponding output file out2.dat. This file corresponds to having 50 burn-in and 100 follow-on iterations, with checkit = NO, details = YES and uses the badsnps file that we created in the previous step.
To do this type on the command line in the examples directory:
>> ./ancestrymap –p paramfile > outf&
Compare the output files outf and out2.dat to make sure you can understand the output generated. Note that the use of the random number generator makes it impossible for the results to be exactly the same for two runs unless the parameter seed has the same value. The important things to focus on in this run are the t(Afr) and t(Eur) values, scores for the various chromosomes and the genome log factor value.
In addition to the standard output, this parameter file will also create a number of output files in the outfiles directory. These files are as follows, and have been discussed in detail in the documentation.
Running your own data files with ANCESTRYMAP
Before going through these steps the user should make sure they are able to follow the steps outlined below using the example files.
- Go to the ancestrymap/bin directory:
- Create the following input files:
- snps ( list of markers, using the format for snps or snpcnts). Note that if you have the marker data in the snps format, and the genotype data for the parental populations you can run the auxiliary program cntmono to obtain a file with marker data in the snpcnts format. [An example parameter file for running cntmono is included: ancestrymap/examples/parmono. For a tutorial on how to run cntmono click here ]
- indiv (list of samples, using the format for indiv)
- genodata (genotype data for all samples and markers, using genodata’s format). Missing data can be input as -1 or not included at all. There will be a fatal error for including genotypes corresponding to samples or markers not included in their respective files.
- Type on the command line:
>> ./ancestrymap –p parbprog. > outbaseprog.dat
Here parbprog should be made using the file parbaseprog as a sample. This step ensures that the input files can be read properly.
- Type on the command line
>>./ancestrymap –p parc0
Here parc0 should be made using the file param0 as a sample, and corresponds to setting checkit, details = YES, and numburn, numiters = 0. From the file corresponding to the parameter indoutfilename, get a list of individuals which have very few genotypes and set their Status field to Ignore in the individual file. From the file corresponding to the parameter snpoutfilename get a list of markers which have low parental genotype counts, and put them in the file corresponding to the parameter badsnpname. Next from the output file generated by this run extract the pair of duplicate individuals (if any), and set the Status field to Ignore in the individual file for one of the pair of individuals.
- Type on the command line
>>./ancestrymap –p parc1 > ancsy.out
Use the parameter file ancestrymap/examples/param1 as an example. This will run the various data checking programs
- Type on command line
The script admcheck will extract the list of the top 10 markers with the highest scores for hwcheck, mapcheck and freqcheck ; and individuals with highest and lowest scores for checkindiv (See Section 5).Use the guidelines in the documentation, and the example to choose the bad markers. One should set the Status field to Ignore in the individual file for individuals which fail the checkindiv test, using the guidelines in Section 5.
- Create a file called badmarkers and put the markers (look for outliers) that failed various checks in this file
- Add to the badmarkerlist file one of the pair of markers that are in linkage disequilibrium with each other.
- Create a parameter file param using ancestrymap/examples/paramfile as an example and type on the command line
>>./ancestrymap –p param