Non-redundant Pangenome construction of Mycobacterium tuberculosis
Y ZHOU(1,2) R Anthony(3) D Soolingen(2,3) Y zhao(1)
1:chinese center for disease control and prevention; 2:Radboud University MC; 3:RIVM
A pangenome provides a more complete overview of the genetic content of a clade than an individual strain’s genome. However, pangenome analysis is time-consuming and resource-intensive, and a standardized pipeline is lacking compromising reproducibility and reliability. We investigated the pangenome of 421 epidemic Mycobacterium tuberculosis strains in China. We produced assemblies using SPAdes with data from the Illumina platform excluding short contigs (<200 bp) and contaminations. CDSs were annotated using Prokka and homologues clustered using get_homologues. Redundant genes were merged for clusters with > 90% identity and > 90% alignment and genes overlapping >60 bp filtered using in-house scripts. The COG and OMCL algorithms identified 7702 and 7995 homologues clusters, respectively. After merging redundant genes, there were 5071 clusters left, and >98% of the merged clusters had less than 450 entries in each cluster. Among these 5071 clusters, 1165 clusters were not clustered with Rv genes. We checked the overlap of these clusters with Rv genes and excluded 575 clusters at this step. Ignoring any mutations in genes, the final pangenome for the 421 genomes consists of 4574 genes/ORFs, including 2980 core genes, 1117 softcore genes, 198 shell genes, and 279 cloud genes. The 590 new genes include 106 core genes, 80 soft-core genes, 132 shell genes, and 272 cloud genes. 184 of these “new genes” are present in the H37Rv genome but are not considered as CDSs in the latest available annotation data. Heap’s index is 1.000013, indicating this sample set of 421 genomes has a closed pangenome.