top of page


Overcoming H37Rv bias: a k-mer based approach to isolate-specific masking

D Whiley(1) C J Meehan(1)

1:School of Science and Technology, Nottingham Trent University, Nottingham, UK

The Mycobacterium tuberculosis complex (MTBC) is monomorphic with low genetic diversity. Genome masking is routinely adopted in MTBC isolate analysis to reduce false positive variant calls in highly variable regions (e.g. PE/PPE genes). Recent studies have shown that H37Rv is likely to be over-masked, leading to the removal of true positive SNPs from analyses. Additionally, such masking is H37Rv specific, making analyses in other lineages difficult. This has led to differences in masking approaches and a lack of masking schemes suitable for other MTBC lineages, confounding comparative analyses. This also has implications for the widely accepted isolate transmission linkage using 5- and 12-SNP thresholds, as different masking schemes will create different SNP counts.

To address these issues, we have developed an automated pipeline to apply consistent mapping to any genome using a k-mer based approach. By identifying regions of genomes where self 50-mer mappability is poor, we generated isolate-specific masking regions. We calculated the true pairwise SNP distances within a set of 300 complete closed genomes from across the diversity of the MTBC. We then calculated false- and true-positive SNP calls between masked paired isolates, followed by mapping Illumina sequencing reads to these pre-masked genomes.

We show that this pipeline can be applied to any MTBC isolate to create strain-specific masking files. This approach generates a minimum SNP distance between isolates, prioritising the minimisation of false positive SNP calling. This allows for consistent comparisons of MTBC isolates and opens the road towards the use of non-H37Rv MTBC reference genomes.

bottom of page