top of page


Genomic epidemiology of tuberculosis using long-read sequencing to increase resolution of transmission clusters

AM García-Marín(1,2) M Torres-Puente(1) M Moreno-Molina(1) M Hunt(3,5) Z Iqbal(3,4) F González-Candelas(2,6) J Alonso-del-Real(1) I Comas(1,6)

1:Tuberculosis Genomics Unit, Instituto de Biomedicina de Valencia (IBV-CSIC), Valencia, 46010, Spain; 2:Joint Research Unit ‘Infección y Salud Pública’ FISABIO-University of Valencia, Institute for Integrative Systems Biology (I2SysBio), Valencia, 46010, Spain; 3:European Bioinformatics Institute, Hinxton, CB10 1SD, UK; 4:Milner Centre for Evolution, University of Bath, Bath, BA2 7AZ, UK; 5:Nuffield Department of Medicine, University of Oxford, Oxford, OX3 7BN, UK; 6:CIBER of Epidemiology and Public Health (CIBERESP), Madrid, 28029, Spain

Genomic epidemiological studies of Mycobacterium tuberculosis (Mtb) often use short-read sequencing, which mainly relies on reference mapping based analysis. However, this method only examines 90% of the Mtb genome because the mapping of short reads on repetitive regions is highly inaccurate. To overcome this limitation, we used a long-read whole-genome sequencing to accurately reconstruct the whole genome of MTBC culture-positive cases over one year (2016) in the Valencia Region (Spain). We aimed to: i) evaluate the added value of long-read sequencing to gain epidemiological resolution; ii) reveal the genetic diversity in repetitive regions at an epidemiological scale.

We sequenced 216/266 MTBC clinical isolates using the PacBio Sequel II platform, obtaining 212 high-quality complete genomes via de novo assembly. Short-read sequencing data obtained by the Illumina MiSeq platform and analyzed by reference mapping were available. Pairwise distances between assemblies were consistently higher compared to those obtained from short-read sequencing. At a global level a mean number of 410 SNPs were gained. When focusing on closely related samples (20 SNPs according to Illumina data), we gained a median of 1 SNP. In 60% of pairs, we detected more SNPs with Pacbio data, rising to 70% when considering indels. Most of the SNPs were located in regions that are masked in the Illumina analysis. Additionally, one of these samples presented high genetic diversity in PE-PGRS28, possibly linked to a gene conversion event. In conclusion, our findings indicate a greater genetic diversity between samples, which can provide valuable insights into transmission within clusters.

bottom of page