Analysis of the limited Mtb pan-genome reveals potential pitfalls of pan-genome analysis approaches
M G Marin(1) M Behruznia(2) C Meehan(2,3) M R Farhat(1,4)
1:Department of Biomedical Informatics, Harvard Medical School; 2:Department of Biosciences, Nottingham Trent University; 3:Unit of Mycobacteriology, Institute of Tropical Medicine; 4:Pulmonary and Critical Care Medicine, Massachusetts General Hospital
It can be difficult to benchmark the accuracy of pan-genome analysis methods given the high diversity of many bacterial species and the common lack of ground truth datasets with complete assemblies. Mycobacterium tuberculosis (Mtb) is a highly clonal bacterium with no horizontal gene transfer, enabling evaluation of the performance and tradeoffs of existing pan-genome analysis approaches. In this work we used a diverse set of 158 Mtb isolates with complete genome assemblies generated using a hybrid long- and short-read sequencing. Across the 22 total different parameters of pan-genome analysis approaches we observed a surprisingly large range of accessory genome size predictions, with predictions ranging from 314 to 2951 genes. To complement this analysis, we built a Mtb pan-genome graph to detect accessory regions at the nucleotide level. From our pan-genome graph approach we find that only 5.5% (70 kb) of the variation represented novel sequence content relative to the H37Rv reference genome. We demonstrate that pan-genome analysis predictions depend heavily on the choice of software, and the quality of the dataset used. If pan-genome analysis methods are not used properly they can greatly overinflate the predicted accessory genome. We find pan-genome graphs are better for identifying loss or gain of new sequences, in comparison to common coding sequence centric approaches. Our analysis of a Mtb pan-genome graph analysis at the nucleotide level further supports that the Mtb genome is evolving in a clonal manner and has a limited accessory genome.