Comparison of 13 software tools to detect structural variation in Mycobacterium tuberculosis
Y Zhou(1,2) D Soolingen(2,3) R Anthony(3) Y zhao(1)
1:China CDC; 2:Radboud University MC; 3:RIVM
Studying structure variations using whole genome sequencing data is challenging due to the lack of reliable tools. To detect deletions, we applied 13 software to 420 Mycobacterium tuberculosis strains sequenced on the Illumina platform. Intersection between different methods was used to measure quality. By investigating the composition of deletion length, the methods could be divided into 4 groups. Group 1 has 5 methods (breakdancer, cnvnator, delly, lumpy-smoove, tiddit ), which detected deletions mostly (61~95%) with a length between 100-1000 bp. Group 2 has 4 methods (assemblytics , unimap, pindel, svaba ), which detected deletions mostly (68~88%) with a length less than 100 bp. Group 3 has 2 methods (manta , softsv ), which detected a large proportion (88% and 30%, respectively) of large deletions (> 100k bp). Group 4 only has utilizes bcftools, which only detected deletions less than 100 bp. Deletions longer than 100k were excluded for subsequent analysis. For each sample, group 1 methods detected 10 to 50 deletions in most samples, group 2 between 110 to 180, group 3 between 50 to 100. Bcftools shows two peaks, one is at about 10 deletions, and the other is 60 deletions. The later peak contains mostly (104/107) Beijing family strains. Group 2 methods show the highest concordance as suggested by intersection proportion (77~95%). Our study suggests that detection of deletions less than 100 bp is mostly reliable using bcftools, pindel, unimap, or svaba. For deletions longer than 10k bp, these 13 methods are not reliable (with Illumina data).