A fast, reliable and easy method to detect within-species DNA contamination

Main Article Content

Tiziano Dallavilla
Giuseppe Marceddu
Arianna Casadei
Luca De Antoni
Matteo Bertelli


NGS, contamination, diagnostics, bioinformatics, quality


Background and aim: Next generation sequencing (ngs) is becoming the standard for clinical diagnosis. Different steps of NGS, such as DNA extraction, fragmentation, library preparation and amplification, require handling of samples, making the process susceptible to contamination. In diagnostic environments, sample contamination with DNA from the same species can lead to errors in diagnosis. Here we propose a simple method to detect within-sample contamination based on analysis of the heterozygous single nucleotide polymorphisms allele ratio (AR). Methods: a dataset of 38000 heterozygous snps was used to estimate the ar distribution. The parameters of the reference distribution were then used to estimate the contamination probability of a sample. Validation was performed using 12 samples contaminated to different levels. Results: results show that the method easily detects contamination of 20% or more. The method has a limit of detection of about 10%, threshold below which the number of false positives increases significantly. Conclusions: the method can be applied to any type of ngs analysis and is useful for quality control. Being fast and easy to implement makes it ideal for inclusion in NGS pipelines to improve quality control of data and make results more robust.


Download data is not yet available.


Metrics Loading ...
Abstract 2 |


[1] Sanger F, Coulson AR. A rapid method for determining sequences in dna by primed synthesis with dna polymerase. J Mol Biol 1975; 94: 441.
[2] Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. PNAS 1977; 74: 5463–7.
[3] Komlosi K, Solyom A, Beck M. The role of next-generation sequencing in the diagnosis of lysosomal storage disorders. J Inborn Errors Metab. Screen. 2016; 4: 2326-4594.
[4] Jamuar SS, Tan EC. Clinical application of next-generation sequencing for Mendelian diseases. Hum Genomics 2015; 6: 9-10.
[5] Buermans HPJ, den Dunnen JT. Next generation sequencing technology: Advances and applications. Biochim. Biophys. Acta 2014; 10: 1932–41.
[6] Jun G, Flickinger M, Hetrick KN, et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am J Hum Genet 2012; 91: 839-48.
[7] Scherczinger CA, Ladd C, Bourke MT, et al. A systematic analysis of pcr contamination. J Forensic Sci 1999; 44: 1042–5.
[8] Pickrahn I, Kreindl G, Müller E, et al. Contamination incidents in the pre-analytical phase of forensic DNA analysis in Austria—Statistics of 17 years. Forensic Sci Int Genet 2017; 31: 12-8.
[9] Patel RK, Mukesh J. NGS qc toolkit: A toolkit for quality control of next generation sequencing data. PLOS One 2012; 7: 1–7.
[10] Lee I, Chalita M, Ha S, Na S, Yoon S, Chun J. Contest16s: an algorithm that identifies contaminated prokaryotic genomes using 16S RNA gene sequences. Int J Syst Evol Microbiol 2017; 67: 2053–7.
[11] Marceddu G, Dallavilla T, Guerri G, Manara E, Chiurazzi P, Bertelli M. Pipemagi: an integrated and validated workflow for analysis of NGS data for clinical diagnostics. Eur Rev Med Pharmacol Sci 2019; 23: 6753–65.
[12] Kluyver T, Ragan-Kelley B, Pérez F, et al. Jupyter notebooks – A publishing format for reproducible computational workflows. 20th International Conference on Electronic Publishing, 2016.
[13] McKinney W. Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference 2010; 445: 51–6.
[14] McKinney W. Pandas: a foundational python library for data analysis and statistics. Python High Performance Science Computer, 2011.
[15] Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. Proceedings of the 9th Python in Science Conference, 2010.
[16] Hunter JD. Matplotlib: A 2D graphics environment. Comput Sci Eng 2007; 9: 90–5.