Analysis Workflow
The YDNA-Warehouse builds its phylogentic tree using a pipeline of bio-informatics packages. This page outlines the full process.
Step 1: Align Sample To CHM13v2.0 using BWA-MEM
To ensure consistency all incoming submissions are realigned with BWA-MEM to CHM13v2.0 (the first complete gap-less human reference.) Upon completion basic alignment quality checks are performed and the results are made available on test kit's Coverage Analysis tab. CHM13v2.0's chrY corresponds to the CP086569.2 Accession. Y Reads indicates the total number of reads with primary alignment to chrY. Callable notes the number of base positions on chrY where at least 4 reads for short-read tests (or 2 reads for long-read tests**) have mapping qualities indicating a confident mapping position. SNPs found in the Callable regions are eligible to included in the tree later in the process.
A sample of the pipeline used to realign BAM submissions can be accessed from GitHub. The script collates the read pairs in a previously aligned BAM.
Then pipes the results into samtools fastq
to feed the reads into BWA-MEM. The pipeline then fixes the mates, sorts the resulting BAM, and marks the duplicate reads into a CRAM file.
** Long-read tests using PacBio's HiFi reads or Oxford Nanopore Technology reads are aligned using Minimap2 instead of BWA-MEM.
Step 2: Prepare a gVCF file
Upon successful QC of the newly aligned CRAM file, a Dragstr parameter model is created. The sample is then processed with GATK4's HaplotypeCaller to produce a gVCF of raw, unfiltered SNP and indel calls. The gVCF is then genotyped with forced output for every known site in the tree. The resulting VCF is then ingested in the reporting database to find the sample's current position in the phylogentic tree. If the assigned branch is aged in the genealogical era, the sample becomes eligible for SNP Matching.
Step 3: Add gVCF to GenomicsDB for Batch Processing
The sample's gVCF is then loaded into one or more GenomicsDBs for joint genotyping a batch of samples. The batches are partitioned by major YDNA subclade, and a special group of short-read WGS and long-read HiFi samples to maintain the root of the tree.
Step 4: Batch Joint Genotyping and Filtering
When sufficient new samples have been added to a Batch, the corresponding GenomicsDB is genotyped to create a VCF of all the chrY SNPs found in the cohort of samples in the batch. The resulting VCF is then filtered to remove indels, SNP site that are not called in more than 10% of the cohort or have low quality scores. The resulting file is then converted to PHYLIP format and SNP site map.
Step 5: Construct Maximum Likelihood Tree
The combined PHYLIP file is then run through IQ-TREE to find the maximum likelihood tree for the cohort using
iqtree2 -s chrY.min4.phy -B 1000 -alrt 1000 -T auto
. If the tree fails to converge, the results are analyzed for SNPs causing problems and additional
filtering is applied to the joint genotyped VCF from Step 4. The process repeats until convergence of the tree.
The ancestral sequence is reconstructed for each internal node of the final consensus tree. Using the ancestral states and the site map from Step 4, the binary tree is processed to eliminate interior nodes and join their children to its parent. This forms the basis of the phylogentic tree. The subclade trees are stitched into the backbone tree at the correct node.
Finally, the taxa nodes from the individual testers may have private SNPs unique to their branch. These branches are not visible in the public tree, but are reported to the sample's owner. They can be viewed on the Private SNPs tab, if there are any sites remaining.