P116
Building a Large Dataset of Genome Mutations Associated with Antibiotic Resistance in Mycobacterium tuberculosis.
J Al Akl(4,5) C Sola(1,2,3) C Guyeux(4) D Laiymani(4) C Abou Jaoude(5) Z Al Chami(5)
1:INSERM, Universite Paris-Saclay; 2:IAME, Universite Paris-Cite; 3:UMR1137, Universite Sorbonne Paris-Nord; 4:FEMTO-ST Institute, CNRS, University of Marie & Louis Pasteur, Belfort; 5:Ticket Lab, Antonine University, Lebanon
Antibiotic-resistant Mycobacterium tuberculosis (Mtb) is a growing threat, and clinicians need clear genetic clues to spot it early. We collected and cleaned a very large set of Mtb genomes tied to drug-susceptibility results—over 15,000 isolates—by merging three sources: (i) public datasets such as CRyPTIC and a recent Nature-published study, (ii) a light AI text-mining system that sifted through more than 20,000 papers to pull extra Sequence Read Archive (SRA) numbers linked to resistance, and (iii) the PATRIC database, which lists thousands of strains with tested drug responses. After matching strain names to SRA records at the National Center for Biotechnology Information, we created a single, standard resource covering first-line, second-line, and newer drugs, and capturing both well-known and newly reported resistance mutations. Early machine-learning runs already confirm classic genotype-phenotype links and point to emerging patterns such as rare rpoB combinations that raise rifampicin MICs. By sharing this dataset and code, we give mycobacteriologists a practical tool to validate suspected resistance mutations, track multidrug-resistant TB, and guide new treatment strategies while keeping the helpful speed of AI in the workflow.
