The Mycobacterium tuberculosis complex pangenome is small and driven by (sub-)lineage specific regions of difference
M Behruznia(1) M Marin(2) M Farhat(2,3) C J Meehan(1,4)
1:Department of Biosciences, Nottingham Trent University, Nottingham, UK; 2:Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA; 3:Pulmonary and Critical Care Medicine, Massachusetts General Hospital, Boston, MA 02114, USA; 4:Unit of Mycobacteriology, Institute of Tropical Medicine, Antwerp, Belgium
The pangenome size and function of the Mycobacterium tuberculosis complex (MTBC) has been greatly debated in recent years, especially in relation to within-lineage gene content differences. We used a curated dataset including 356 complete genomes from across the MTBC lineages (human and animal) to investigate this gene content variation and identify differential traits between strains that may affect virulence, metabolism, and evolutionary characteristics.
TB-profiler was used for lineage assignment and BUSCO for genome quality control. Genome annotation was performed using PGAP, and Panaroo was used for pangenome analysis. Gene clusters were assigned COG functional categories using eggnog-mapper. Statistical analysis was conducted to examine whether there is a significant association between accessory genome distribution and (sub-)lineage. Multiple whole genome alignment was performed using SibeliaZ to look for sub-lineage regions of differences (i.e., deletions > 10kb in length).
We found a pangenome consisting of 4,115 genes, including 3,636 core genes and a small accessory genome of 479 genes, supporting the clonal evolution of the species. The accessory genome was enriched in transcription, metabolism, and virulence genes whereas the core genome was enriched in genes linked to lipid metabolism and transport, which are essential in host-pathogen interactions. Despite the compact accessory genome size, we identified 296 lineage-specific genes which could explain the differences in metabolic variations and virulence potential. Regions of difference specific to certain sub-lineages were found throughout the MTBC except for lineage 3, indicating that genomic differences exist both between and within the primary lineages.