Built-in data error checks | David Reich Lab

5. Built-in data error checks

Next we shall focus on the built-in data checking programs. Like most other methods for whole genome scans, admixture analysis is very sensitive to data problems and the software incorporates a number of tools to check for the more common kinds of errors. The user is strongly advised to run these tests because our experience suggests that most data sets even when carefully curated contain some problems which can lead to spurious associations to disease. In order to run these checks on the data, you have to run the ANCESTRYMAP program with the checkit parameter set to YES in the parameter file. The description of these tests and their output is as follows:

Hetxcheck:

Check to see if there are any heterozygous counts on the X chromosome for the male samples. The program will disregard heterozygous genotype value for male samples on the X chromosome, if there are any. The output from this check is as follows for all the markers:

SNP_ID NUM_HET NUM_HOMOZY

>> hetxcheck rs211644 0 310

Here NUM_HET and NUM_HOMOZY are the number of heterozygous and homozygous counts respectively on the X chromosome for the male samples.

checkgeno

Checks to see if there are any genotype values > 2, prints out a warning and ignores that genotype for the rest of the analysis.

>>bad genotype: rs897634 1 4

This test also outputs the total number of good and bad genotypes. Ex:

>>Num good genotypes: 4711298 Num bad genotypes: 0

physcheck

This is a check to find markers which are flipped with respect to their genetic and physical positions. The output is only for the set of two markers, where there has been a mix up of physical and genetic positions. This check gives a warning only, since we use genetic distance in the analysis, and not the physical distance.

SNP1_Id SNP2_Id SNP1_Gen_Pos SNP2_Gen_pos SNP1_Phys_Pos SNP2_Phys_Pos

>> physcheck rs11231098 rs435582 0.628 0.649 61996439 41813770

Hardy-Weinberg test:

Performs the hardy-weinberg equilibrium test for each marker and prints out:

SNP_Id Chr_Num SNP_Index HW_score.

>> hwcheck rs897634 1 0 -1.526

A positive HW_score is indicative of too many heterozygous counts, and a negative score is indicative of too many homozygous

counts. For markers that are highly differentiated in frequency, a deficit of heterozygotes is often observed in a population such as African Americans (this is called the Wahlund effect in population genetics). Thus a hwcheck result showing an excess of heterozygotes should be a greater cause for worry than one showing a deficit. One should look for outliers in this test.

checkdup

This checks individuals to see if there are any duplicate individuals based on the amount of match between their genotypes. If there is more than a 75% match in the genotypes for two samples, this test prints out:

>>##Num of genotypes matched: Num of genotypes mismatched

>>##If the status of the two individuals does not match Status1: Status2

>>dup? Indiv_1 Indiv_2

>>match: 1400 mismatch: 2

>>status_1: Case status_2: Control

The above example indicates that these two individuals are probably the same since their genotypes match exactly. Note that the test also prints out the status of the two individuals if they do not match.

However, the next example shows that even though there might be say 80% match between two individuals, they might not be duplicates since the number of genotypes compared is not very large. The user has to look at the results carefully and decide which pair of individuals are duplicates and which are not. The user should set their own cutoff of what defines a duplicate pair of individuals.

>>dup? Indiv_2 Indiv_35

>> match: 100 mismatch: 20

New in Version 2.0: fastdup parameter

We allow for a very fast, but far from comprehensive check for duplicate samples. The basic algorithm chooses 15 markers and looks at genotypes on these 15 markers. Pairs of individuals with an exact match on the 15 markers are checked for near duplicates with a slow algorithm that counts matches and mismatches for every marker. We iterate this check fastydupnum times (default for fastdupnum: 10).This check is very fast and has a reasonable chance of finding duplicates, but can be defeated by missing genotypes, or genotype errors. To run this check set

fastdup = YES
dupmode:= YES
for a careful check for duplicates with running time proportional to the square of the number of samples. The output for this check prints out the duplicate pair IDs, # matches, # mismatches, # valid genotypes for each individual in the pair, and then automatically “ignores” one of the samples. The user will have to set one of the samples to ‘Ignore’ in the sample file on their own.

>>dup? Indiv_1 Indiv_2

>>match: 665 mismatch: 0 1450 1495

>>dup. Indiv_1 ignored

mapcheck:

Compares ancestry estimates obtained for each marker by itself to that predicted by adjacent markers (leaving out the marker of interest). A discrepancy indicates a misspecification of a marker’s genomic position. A negative difference is not worrisome however a positive difference should be investigated more carefully, especially if it is higher than 3 or 4. Note that for this test it is more important to look for outliers than at absolute values alone.

SNP_ID SNP_Index Ancestry Difference

>> mapcheck rs897634 0 -23.679

Here SNP_Index is the marker’s index number internal to the program.

freqcheck

Freqcheck compares the estimated frequencies of an allele from the MCMC (Markov Chain Monte Carlo) with a max likelihood fit. S() is a likelihood ratio statistics, approximately chi-square with 1 d.o.f. if the frequencies look fine. This is really a check that the parental frequencies are plausible. S scores above 10 are highly dubious, above 20 indicate a problem .A common reason for this error is due to an interchange of alleles ( flipped marker ). Same as in the case for mapcheck it is more important to look for outliers.

SNP_ID Chr_num S(All) S(Controls) F(A) F(E) G(A) G(E)

>> freqcheck rs897634 1 1.133 2.563 0.086 0.765 0.062 0.847

Here S(All) is on all the data, and S(Controls) is on controls only, as a very strong disease effect in cases can distort the true frequency. F(A), F(E) are estimated frequencies for the African and European parental samples using the MCMC, G(A), G(E) are the corresponding maximum likelihood fits.

leave1out:

Removes the marker contributing the most to any association and assesses whether the signal of association persists. If it remains even after leaving out the best marker, it is less likely to be an artifact due to a single marker. This is a computationally expensive check to run, and needs a large amount of disk space and might crash if that is not available.

>> scores for each fake

>> chrom SNP_ID base min max

>> 1 1 fake-1:0 -6.912 -7.063 -5.286

>> 3 1 fake-1:1 -6.965 -7.169 -5.389

>> 4 1 fake-1:2 -6.752 -7.050 -5.425

Here base is the score that we get without using the leave1out algorithm, min and max are the minimum and maximum scores obtained after leaving one marker in turn. The max score is not relevant, however it is a cause of worry if the min score is very much lower than the base score.

This test gives the following output as well for all the chromosomes:

>>chrom base min max

>>best score (chrom) 1 16.761 14.245 16.875

>>best score (chrom) 2 -2.101 -2.829 -1.140

>>best score (chrom) 3 -3.416 -3.645 -1.207

>>best score (chrom) 4 -3.910 -4.308 -2.679

>>best score (chrom) 5 -1.444 -1.687 -0.438

>>best score (chrom) 22 -2.006 -6.231 -1.065

>>best score (chrom) 23 2.736 -4.751 17.266

global score (leave1): 13.208

checkindiv

New in Version 2.0

This implements a crude check on whether an individual should be included in the scan, using the idea of estimating global ancestry (proportion of European ancestry for African-Americans). Given the variant allele frequency conditional on ancestry for marker k we can compute the probability distribution of 0, 1, 2 variant alleles and hence a log-likelihood score L(k). We also can compute the mean and variance of L(k). Accumulating the statistic L(k) over all markers k we get a statistic whose mean and variance is known. Therefore we can compute a Z-score, large negative scores (say < -6) should be discarded. In practice we also find large positive scores. These individuals usually have parents with very divergent ancestries, whose children therefore have, marker by marker ancestry close to the mean. We recommend that such individuals are also not used in the scan, though this is a minor issue as they will not contribute much to the admixture score. Here is some output from samples that we would not use in a scan. Note that the top 3 individuals have ancestry proportions near 50%.

>>### ID P(E) --- Z-score
>>checkindiv Hi1 0.504 156.756 10.464 1328 888.585 0.669
>>checkindiv Hi2 0.514 143.433 10.451 1149 730.872 0.636
>>checkindiv Hi3 0.506 146.305 10.286 1225 824.703 0.673
>>checkindiv Lo1 0.360 -103.759 -6.970 1128 1604.437 1.422
>>checkindiv Lo2 0.364 -131.407 -9.242 1026 1639.528 1.598