OR15

A standard and framework to facilitate the comparison and benchmarking of methods for resistance prediction from genomic data

J Libiseller-Egger(1) J Phelan(1) T G Clark(1)

1:London School of Hygiene and Tropical Medicine (LSHTM)

With the ever-increasing wealth of genomic data and rapid advancements in statistical learning methodologies, a slew of machine learning models for predicting drug resistance from genomic data have been published in recent years, including for tuberculosis. However, many publications release neither the source code for generating the model nor a computer application for using the final model itself, which makes the reproduction of results and comparing different models and methodologies difficult. We present a standardised framework for packaging prediction pipelines into Docker containers to facilitate systematic comparisons of prediction performance among different machine learning models and tools utilising direct association. Splitting the prediction pipeline into two steps (one Docker container for quality control and pre-processing of the sequencing data and one container for the actual prediction) allows for comparing both, (i) completely disparate approaches (e.g. variant-based vs k-mer-based) by using different containers for both steps and (ii) more similar pipelines (e.g. a random forest model and an elastic net using the same variants as input) by swapping out one container. The approach was formulated with maximum flexibility in mind, so that the Docker containers can be ergonomically used with workflow management systems or alternatively in scripts. We provide example implementations for both cases and have created containers for several popular resistance prediction methods for Mycobacterium tuberculosis.