A New String Edit Distance and Applications
Abstract
:1. Introduction
1.1. Background
1.2. Short Tandem Repeats
2. Restricted Forensic Levenshtein Distance
2.1. Levenshtein Distance Overview
2.2. Forensic Distance Overview
2.3. Details of the RFL Algorithm
2.3.1. A Perspective on Dynamic Programming
2.3.2. A Note on Weighted Edit Distance
2.3.3. Restricted Forensic Levenshtein Distance
2.3.4. How Restricted to Be?
2.3.5. Multiple Motifs and Time Complexity
2.4. Implementation of the RFL Distance
2.4.1. Building the First Row
- The insert cost of the character G
- The cost of stuttering forward GGC plus the cost of the weighted Levenshtein distance from GGC to G.
- The cost of stuttering forward ATTC plus the cost of the weighted Levenshtein distance from ATTC to G.
- + the cost of inserting the second character via pure insertion (i.e., the cost of inserting , using 0-based indexing)
- + the cost of inserting the second character via the stutter dictionary (for each stutter dictionary)
- + the cost of inserting the first two characters via the stutter dictionary (for each stutter dictionary)
Listing 1. Building the first row of the dynamic programming matrix in the RFL algorithm. |
2.4.2. Filling Out the Matrix
Listing 2. Building the dynamic programming matrix in the RFL algorithm. |
2.4.3. Practical Implementation
2.5. Validation of the Algorithm
3. Results and Discussion
3.1. Selecting Motifs for Each Locus
4. Conclusions
5. Disclaimer
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
STR | Short Tandem Repeat |
RFL | Restricted Forensic Levenshtein |
PCR | Polymerase Chain Reaction |
CE | Capillary Electrophoresis |
MPS | Massively Parallel Sequencing |
DoC | Depth of Coverage |
ACR | Allele Coverage Ratio |
Appendix A
Appendix A.1. Complexities
Appendix A.2. A Worked Example
A | C | G | T | C | G | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
A | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
C | 2 | 1 | 0 | 1 | 2 | 3 | 4 |
G | 3 | 2 | 1 | 0 | 1 | 2 | 3 |
A | C | G | T | C | G | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 1 | 2 | 3 | 3 | |
A | 1 | 0 | 1 | 2 | 2 | 3 | 4 |
C | 2 | 1 | 0 | 1 | 2 | 2 | 3 |
G | 1 | 2 | 1 | 0 | 1 | 2 | 2 |
A | C | G | T | C | G | ||
---|---|---|---|---|---|---|---|
0 | 1 | 3 | 2.5 | 4.5 | 6.5 | 6.5 | |
A | 1 | 0 | 2 | 3 | 4 | 6 | 7 |
C | 1.5 | 1 | 0 | 1 | 3 | 4 | 5 |
G | 0.5 | 1.5 | 1 | 0 | 2 | 4 | 4 |
Appendix A.3. An Algorithmic Limitation
A | C | G | T | C | G | ||
---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
A | 1 | 0 | 1 | 2 | 3 | 4 | 5 |
C | 1 | 1 | 0 | 1 | 2 | 3 | 4 |
G | 3 | 2 | 1 | 0 | 1 | 2 | 3 |
- Cost to insert a C, T, or G: 10
- Cost to insert an A: 1
- Cost of T → C, A → C, or A → G: 10
- Cost of G → C, A → T, or T → G: 1
References
- Rinartha, K.; Suryasa, W.; Kartika, L.G.S. Comparative Analysis of String Similarity on Dynamic Query Suggestions. In Proceedings of the 2018 Electrical Power, Electronics, Communications, Controls and Informatics Seminar (EECCIS), Batu, Indonesia, 9–11 October 2018; pp. 399–404. [Google Scholar] [CrossRef]
- Alberga, C.N. String Similarity and Misspellings. Commun. Acm 1967, 10, 302–313. [Google Scholar] [CrossRef]
- Cheatham, M.; Hitzler, P. String Similarity Metrics for Ontology Alignment. In Proceedings of the The Semantic Web—ISWC 2013; Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 294–309. [Google Scholar]
- Chang, W.I.; Lawler, E.L. Sublinear approximate string matching and biological applications. Algorithmica 1994, 12, 327–344. [Google Scholar] [CrossRef]
- Alsmadi, I.; Nuser, M. String Matching Evaluation Methods for DNA Comparison. Int. J. Adv. Sci. Technol. 2012, 47, 13–32. [Google Scholar]
- Qi, X.; Wu, Q.; Zhang, Y.; Fuller, E.; Zhang, C.Q. A Novel Model for DNA Sequence Similarity Analysis Based on Graph Theory. Evol. Bioinform. 2011, 7, EBO.S7364. [Google Scholar] [CrossRef] [PubMed]
- Butler, J.M. The future of forensic DNA analysis. Philos. Trans. R. Soc. 2015, 370, 20140252. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Clayton, T.; Whitaker, J.; Maguire, C. Identification of bodies from the scene of a mass disaster using DNA amplification of short tandem repeat (STR) loci. Forensic Sci. Int. 1995, 76, 7–15. [Google Scholar] [CrossRef]
- Andelinović, S.; Martín, P.; Sutlović, D.; Erceg, I.; Huffine, E.; de Simón, L.F.; Albarrán, C.; Definis-Gojanović, M.; Fernández-Rodriguez, A.; García, P.; et al. DNA typing from skeletal remains: Evaluation of multiplex and megaplex STR systems. Croat. Med. J. 2001, 42, 260–266. [Google Scholar]
- Budowle, B.; Schmedes, S.E.; Wendt, F.R. Increasing the reach of forensic genetics with massively parallel sequencing. Forensic Sci. Med. Pathol. 2017, 13, 342–349. [Google Scholar] [CrossRef]
- Urquhart, A.; Kimpton, C.P.; Downes, T.J.; Gill, P. Variation in Short Tandem Repeat sequences—A survey of twelve microsatellite loci for use as forensic identification markers. Int. J. Leg. Med. 1994, 107, 13–20. [Google Scholar] [CrossRef]
- Alford, R.L. Rapid and efficient resolution of parentage by amplification of short tandem repeats. Am. J. Hum. Genet. 1994, 55, 190–195. [Google Scholar]
- Frégeau, C.; Fourney, R. DNA typing with fluorescently tagged short tandem repeats: A sensitive and accurate approach to human identification. BioTechniques 1993, 15, 100–119. [Google Scholar]
- Gettings, K.B.; Kiesler, K.M.; Faith, S.A.; Montano, E.; Baker, C.H.; Young, B.A.; Guerrieri, R.A.; Vallone, P.M. Sequence variation of 22 autosomal str loci detected by next generation sequencing. Forensic Sci. Int. Genet. 2016, 21, 15–21. [Google Scholar] [CrossRef] [Green Version]
- Brookes, C.; Bright, J.A.; Harbison, S.; Buckleton, J. Characterising stutter in forensic STR multiplexes. Forensic Sci. Int. Genet. 2012, 6, 58–63. [Google Scholar] [CrossRef]
- Raz, O.; Biezuner, T.; Spiro, A.; Amir, S.; Milo, L.; Titelman, A.; Onn, A.; Chapal-Ilani, N.; Tao, L.; Marx, T.; et al. Short tandem repeat stutter model inferred from direct measurement of in vitro stutter noise. Nucleic Acids Res. 2019, 47, 2436–2445. [Google Scholar] [CrossRef] [Green Version]
- Daunay, A.; Duval, A.; Baudrin, L.G.; Buhard, O.; Renault, V.; Deleuze, J.F.; How-Kit, A. Low temperature isothermal amplification of microsatellites drastically reduces stutter artifact formation and improves microsatellite instability detection in cancer. Nucleic Acids Res. 2019, 47, e141. [Google Scholar] [CrossRef]
- Brill, E.; Moore, R.C. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics, Hong Kong, China, 3–6 October 2000; pp. 286–293. [Google Scholar] [CrossRef] [Green Version]
- Boytsov, L. Indexing methods for approximate dictionary searching. ACM J. Exp. Algorithmics 2011, 16, A8–A9. [Google Scholar] [CrossRef]
- Ukkonen, E. Algorithms for approximate string matching. Inf. Control 1985, 64, 100–118. [Google Scholar] [CrossRef] [Green Version]
- Gao, X.; Xiao, B.; Tao, D.; Li, X. A survey of graph edit distance. Pattern Anal. Appl. 2009, 13, 113–129. [Google Scholar] [CrossRef]
- Fischer, A.; Suen, C.Y.; Frinken, V.; Riesen, K.; Bunke, H. Approximation of graph edit distance based on Hausdorff matching. Pattern Recognit. 2015, 48, 331–343. [Google Scholar] [CrossRef]
- Neuhaus, M.; Bunke, H. Automatic learning of cost functions for graph edit distance. Inf. Sci. 2007, 177, 239–247. [Google Scholar] [CrossRef]
- Darwiche, M.; Conte, D.; Raveaux, R.; T’Kindt, V. Graph edit distance: Accuracy of local branching from an application point of view. Pattern Recognit. Lett. 2020, 134, 20–28. [Google Scholar] [CrossRef] [Green Version]
- Petty, T. restricted-forensic-levenshtein. GitHub, 2021. Available online: https://github.com/taylorpetty/restricted-forensic-levenshtein (accessed on 1 May 2022).
- Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
- Zhao, C.; Sahni, S. String correction using the Damerau-Levenshtein distance. BMC Bioinform. 2019, 20, 277. [Google Scholar] [CrossRef]
- Hirschberg, D.S. Algorithms for the longest common subsequence problem. JACM 1977, 24, 664–675. [Google Scholar] [CrossRef] [Green Version]
- Wagner, R.A.; Lowrance, R. An Extension of the String-to-String Correction Problem. JACM 1975, 22, 177–183. [Google Scholar] [CrossRef]
- Rane, S.; Sun, W. Privacy preserving string comparisons based on Levenshtein distance. In Proceedings of the 2010 IEEE International Workshop on Information Forensics and Security, Seattle, WA, USA, 12–15 December 2010; pp. 1–6. [Google Scholar]
- Woerner, A.E.; King, J.L.; Budowle, B. Fast STR allele identification with strait razor 3.0. Forensic Sci. Int. Genet. 2017, 30, 18–23. [Google Scholar] [CrossRef]
- Su, D. weighted-levenshtein. Python Software Foundation, 2018. Available online: https://pypi.org/project/weighted-levenshtein/ (accessed on 1 May 2022).
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2016, arXiv:1603.00278. [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
- Van der Loo, M. The stringdist package for approximate string matching. R J. 2014, 6, 111–122. [Google Scholar]
Operations Accounted for | Classical Levenshtein | Naive Forensic | RFL |
---|---|---|---|
Single-character edits | ✓ | ✓ | ✓ |
Pure motif stutter | ✓ | ✓ | |
Distinguishing stutter and single-character edits | ✓ | ||
Multi- and single-character edit interactions | ✓ |
Locus | Motif | Median LUS | Min LUS | Max LUS | Mean on-LUS Back Stutter Rate | Total Parents at Locus with Motif | Unique Parents at Locus with Motif |
---|---|---|---|---|---|---|---|
CSF1PO | TCTA | 11 | 4 | 15 | 0.0219 (163) | 1159 | 17 |
D10S1248 | GGAA | 14 | 8 | 19 | 0.0291 (190) | 1182 | 13 |
D12S391 | TAGA | 12 | 7 | 19 | 0.0279 (115) | 1287 | 84 |
D12S391 | CAGA | 6 | 3 | 10 | 0.0085 (67) | 1287 | 84 |
D13S317 | TATC | 12 | 7 | 16 | 0.0183 (105) | 1219 | 30 |
D13S317 | AATC | 2 | 2 | 3 | 0.0021 (56) | 803 | 10 |
D16S539 | GATA | 11 | 5 | 14 | 0.0214 (123) | 1207 | 17 |
D18S51 | AGAA | 15 | 9 | 24 | 0.0249 (125) | 1242 | 27 |
D19S433 | TCCT | 13 | 7 | 18 | 0.0218 (133) | 1194 | 22 |
D1S1656 | TATC | 13 | 9 | 17 | 0.0332 (141) | 1270 | 30 |
D1S1656 | AC | 6 | 5 | 6 | 0.0059 (26) | 1270 | 30 |
D21S11 | TATC | 12 | 7 | 15 | 0.0224 (96) | 1264 | 79 |
D21S11 | TGTC | 6 | 4 | 8 | 0.0021 (21) | 1264 | 79 |
D22S1045 | ATT | 12 | 5 | 16 | 0.0316 (233) | 1179 | 13 |
D2S1338 | GGAA | 13 | 8 | 17 | 0.0221 (80) | 1269 | 69 |
D2S1338 | GGCA | 7 | 3 | 9 | 0.0033 (34) | 1269 | 69 |
D2S441 | CTAT | 11 | 7 | 14 | 0.0188 (114) | 1205 | 24 |
D3S1358 | CTAT | 13 | 5 | 17 | 0.0309 (143) | 1238 | 26 |
D3S1358 | CTGT | 2 | 2 | 4 | 0.0035 (85) | 987 | 19 |
D5S818 | ATCT | 12 | 7 | 15 | 0.0257 (119) | 1237 | 31 |
D7S820 | CTAT | 10 | 6 | 13 | 0.0163 (96) | 1229 | 30 |
D8S1179 | CTAT | 12 | 8 | 15 | 0.0251 (115) | 1237 | 34 |
D8S1179 | CTGT | 2 | 2 | 3 | 0.0013 (51) | 20 | 6 |
FGA | GAAA | 14 | 9 | 19 | 0.0211 (109) | 1247 | 32 |
Penta D | GAAAA | 11 | 5 | 17 | 0.0051 (47) | 1236 | 18 |
Penta E | TTTTC | 12 | 5 | 25 | 0.0114 (89) | 1240 | 25 |
TH01 | AATG | 7 | 5 | 11 | 0.008 (63) | 1165 | 12 |
TPOX | AATG | 9 | 5 | 13 | 0.0113 (90) | 1140 | 14 |
VWA | ATAG | 12 | 3 | 16 | 0.0247 (146) | 1242 | 32 |
VWA | ACAG | 4 | 3 | 6 | 0.0023 (50) | 1242 | 32 |
VWA | GATG | 3 | 3 | 4 | 0.0019 (8) | 81 | 2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Petty, T.; Hannig, J.; Huszar, T.I.; Iyer, H. A New String Edit Distance and Applications. Algorithms 2022, 15, 242. https://doi.org/10.3390/a15070242
Petty T, Hannig J, Huszar TI, Iyer H. A New String Edit Distance and Applications. Algorithms. 2022; 15(7):242. https://doi.org/10.3390/a15070242
Chicago/Turabian StylePetty, Taylor, Jan Hannig, Tunde I. Huszar, and Hari Iyer. 2022. "A New String Edit Distance and Applications" Algorithms 15, no. 7: 242. https://doi.org/10.3390/a15070242
APA StylePetty, T., Hannig, J., Huszar, T. I., & Iyer, H. (2022). A New String Edit Distance and Applications. Algorithms, 15(7), 242. https://doi.org/10.3390/a15070242