6.3 Output Files | David Reich Lab

6.3 Output Files

Next we will discuss the format of the output files, which can be specified in the parameter file. For an explanation of the format of these files click here.

indoutfilename specifies the following information for all the samples analyzed:
- Indiv_Id
- Gender
- Status
- Num_valid_genotypes

snpoutfilename specifies the following information for all the markers analyzed:
- Snp_Id
- Chromosome_num
- Genetic_pos
- Physical_pos
- Pop_A_variant_allele_count
- Pop_A_ref_allele_count
- Pop_B_variant_allele_count
- Pop_B_ref_allele_count
- Case_genotype_count
- Control_genotype_count

thetafilename specifies the following information for all the analyzed samples:
- Indiv_index
- Indiv_id
- θ_true: “true” value of θ or M, printed only in the simulation mode
- θ_ mean: population A ancestry for the autosomes averaged over all the iterations for a particular individual
- θ _sdev: standard deviation of θ_ mean
- θX_true: “true” values of θX or MX, printed only in the simulation mode
- θX_mean: population A ancestry for the X chromosome averaged over all the iterations for a particular individual
- θX_sdev: standard deviation of θX_mean
- Status

lambdafilename specifies the following information for all the analyzed samples:
- Indiv_index
- Indiv_Id
- λ_true: “true” value of λ, printed only in the simulation mode
- λ_ mean: λ for the autosomes averaged over all the iterations for a particular individual
- λ_sdev: standard deviation associated with λ_ mean
- λX_true: “true” value of λX , printed only in the simulation mode
- λX_mean: λ for the X chromosome averaged over all the iterations for a particular individual
- λX_sdev: standard deviation associated with λX_mean

freqfilename specifies the following information for all the markers analyzed:
- SNP_Index: index internal to the program for the snp
- SNP_ID
- chromosome_num
- atrue: “true” reference allele frequency in population A, valid only in simulation mode
- anaive: naïve frequency of the reference allele in population A using the ancestral genotype data
- amean: calculated frequency of the reference allele in population A averaged over all the iterations
- asdev: standard deviation associated with amean
- btrue: “true” reference allele frequency in population B, valid only in simulation mode
- bnaive: naïve frequency of the reference allele in population B using the ancestral genotype data
- bmean: calculated frequency of the reference allele in population B averaged over all the iterations
- bsdev: standard deviation associated with bmean

ethnicfilename specifies the following information for all the markers:
- SNP_Index
- chromosome_num
- SNP_ID
- Avg_ethnicity: Average θ or M over all iterations, and over all individuals at a particular marker.

pubxfile: Contains ancestry estimates for either a single marker or individual depending on the usage. In either case it outputs the probability of having 0, 1 or 2 PopB chromosomes in the columns G[0],G[1] and G[2].

localoutfilename: contains the scores for all the markers:
- SNP_Index
- Chromosome_Num
- Physical_Pos
- Genetic_Pos
- Log Genome Score
- Case Control Score
- G(Case) : Average ancestry for all cases at that marker
- G(control) : Average ancestry for all controls at that marker
- rpower: Information content

output: This is the output file which has the following information for all the Markov chain monte carlo iterations:
- Iteration_Num
- θ_mean
- θx_mean
- θ_corr
- λ_mean
- λx_mean
- λ_corr
- t(popA)
- t(popB)
- log score
- log score averaged over iterations

Note that if this file name is not specified in the parameter file, we write the above to the standard output.

If the program is run with checkit = YES, then the results of the data check programs mentioned in Section 5 are directed to the standard output.

As detailed in the paper we feel that 100 burn-in iterations and 200 follow on iterations should be sufficient for most analysis. These are the number of suggested iterations for most exploratory runs, and user can increase these numbers in order to confirm the results. One can plot the genome-wide-score as a function of iteration number, to see how well the score converges.

The following two files are written to when we run the program in the simulation mode:

Genotoyoutfilename: specifies genotype data for all the markers and simulated individuals in simulation mode:
- SNP_ID
- Indiv_ID
- Vart_allele_count

Indtoyoutfilename: specifies the following information for the simulated individuals in simulation mode:
- Indiv_ID
- Gender
- Population