Next Generation Sequencing Statistics

Mean NGS Callable Coverage by Test Type
Test Type Callable Loci Poznik Loci Poor Alignment Total Loci Read Info Samples (n) Est. Years/SNP Histogram
Avg CV Avg CV Avg CV Avg CV Length Insert Size Median Depth Aligned Reads

Explanation of Metrics

The coverage table is produced using BWA-MEM aligned BAMs on the GRCh38 reference genome. The BAMs are processed by GATK's CallableLoci tool with default settings. For a location to be considered callable it must have four reads overlapping the site. No more than ten percent of those reads may have a PHRED-scaled alignment quality of less than 10. Yielding a heuristic combined quality indicating less than 0.01% chance all the reads are aligned incorrectly to the site.

Callable - The total number of sites sequenced with 4 or more reads. Each read has less than 10% likelihood of being misaligned.

Poznik Loci - The callable sites occurring in Poznik et al. (2013)'s call mask. These regions are a superset of the combBED used by Adamov's group. The sites include areas outside of Big Y's targeted sequencing regions.

Poor Alignment - The sites having more than four reads, but the aligner indicates a high likelihood the results may not be correct.

Length - The number of bases in reads delivered by the lab. Longer lengths allow more accurate alignment, but increase the likelihood of sequencing errors.

Insert Size - The estimated size of the DNA fragment in the library. Aligners use the insert length to find the best fit for the pair on the reference genome. This number should be more than twice as long as the read length for paired-end reads to minimize overlap.

Median Depth - The median depth of reads aligned to GRCh38 chrY. For a WGS test the number should be approximately 50% of advertised sequencing depth due to there only being one copy of the Y chromosome.

Aligned Reads - The number of reads aligned to the Y chromosome.

Samples (n) - The size of the population who have submitted BAMs for this type of test. The numbers are self-selected and likely bear little resemblance to actual market size.

Est. Years/SNP - This figure applies a mutation rate of 8.2e-10 per year per bp to the Callable Loci statistic. The constant is sourced from Poznik et al. (2013)

Total Loci - The sites having at least one read assigned. This metric is less useful when looking at a sample in isolation, but can be useful when looking at the population as a whole. The PAR regions are not included in these statistics as they are treated as part of the X chromosome due to recombination.

Histogram - The histogram is a proportional representation of coverage segments in the population for each test type. The green stacks correspond the the Callable loci statistic. The red stacks correspond to the Poor Alignment loci. In most cases tests with longer read fragments will have more green and less red. The black regions are assigned as N (any) bases in the GRCh38 reference sequence.

Special Discussion for WGS Tests

When comparing the coverage statistics, keep in mind WGS tests contain unassigned Y chromosome reads. The current human reference genome contains 33,591,060 bases that have not been assigned an allele value. The WGS reads that belong in this expanse cannot be placed today. However as new references are made available or the hardware requirements to de novo assembly become more affordable, the additional data will become more useful.

This is in contrast to targeted tests like Big Y and Y Elite. The targeting process not only makes the tests more economical, it also results in Y DNA fragments being washed away prior to sequencing. When the reference is updated the tests need to be redesigned and run again with all the physical costs of new sequencing. This is the reason Big Y (500) customers are required to pay an upgrade fee for Big Y 700.

Another related issue with the Callable Loci statistic in lower average depth WGS tests is the nature of how read alignment works. On average WGS tests have fifty percent of test's rated depth covered in the Y chromosome e.g. a 15x test has 7 reads spanning each location. The similarity of many regions make the edit distance scoring used to place reads erratic. You will see locals with zero assigned reads and others with more than a thousand. The increased read depth helps to span the gaps around regions such as this and makes it possible for automated callers do their work. However, a skilled analyst can use their knowledge of how reads are aligned to make inference about these gaps by looking at the alternate sites. These issues will be reduced as newer third-generation sequencing techniques are introduced to allow hybridized alignments but will also require new testing.