TREASURE: Text Mining Algorithm Based on Affinity Analysis and Set Intersection to Find the Action of Tuberculosis Drugs against Other Pathogens
Abstract
:1. Introduction
2. Literature Review
3. Materials and Methods
Algorithm 1 Outline of TREASURE model |
Input: Abstracts from PubMed Database Output: A filtered resultant set containing frequently occurring word patterns 1. Data preprocessing 1.1 Tokenizing 1.2 Remove punctuations and stop words 1.3 Stemming 1.4 TF-IDF calculation 2. Generate the relation records via affinity analysis 3. Based on the co-occurrence value of each element and set intersection property, the resultant set is filtered out from the relation records |
3.1. Data Preprocessing
Algorithm 2 TREASURE Data Preprocessing |
Input: PubMed abstracts in a csv file Output: Preprocessed data in a csv file 1. Loop through the entire csv file 1.1 Perform tokenization on the document 1.2 Remove punctuations 1.3 Remove stop words 1.4 Stem the tokenized words to get the root words 2. Calculate term frequency for the words in document as given in Equation (1) 3. Calculate inverse document frequency as given in Equation (2) 4. Compute tf-idf values for the words as given in Equation (3) 5. Set a minimum threshold value for tf-idf 6. Open a new csv file 6.1 For row_i write each word of document_i whose tf-idf > threshold |
3.2. Generation of Relation Records Using Affinity Analysis
Algorithm 3 TREASURE Relation Records Generation via Affinity Analysis |
Input: Preprocessed csv file, minimum number of items in a set as min_length, minimum co-occurrence value as min_support and the minimum conditional property as min_cofidence Output: A JSON file containing a list of relation records with corresponding confidence, support and lift values 1. Read each item in the file 2. Calculate support for every item as given in Equation (4) 3. Insert every item into a frequent dataset whose support ≥ min_support 4. For each item in the frequent dataset calculate confidence and lift values as given in Equations (5) and (6) 5. Insert every rule into a JSON file whose confidence and items count are greater than the corresponding threshold |
3.3. Filtering Relation Records Based on Maximum Co-Occurrence Value and Set Intersection Property
Algorithm 4 TREASURE Relation Records Filtration based on Maximum Co-occurrence Value and Set Intersection |
Input: A list of relation records from the JSON file as D Output: A filtered resultant set S containing frequently occurring word patterns 1. Initialize an empty dictionary ED and an empty set ES 2. For each i in range (length (D)) 2.1 For each j in range (i + 1, length (D)) 2.1.1 Initialize an empty set IS 2.1.2 Find D[i] ∩ D[j] and store the result in IS 2.1.3 If length (IS) > 0 then • Store D[i] and D[j] as a key/value pair in ED such that an object capable of holding various items is associated with each key as value • Put D[i] and D[j] as individual elements in ES 3. Initialize an empty set S 4. While ES is not empty 4.1 Find an element s with maximum co-occurrence value in ES 4.2 Add the element s to S 4.3 Find all the elements which on set intersection with s is not null from ED 4.4 Remove those elements and the element s from the set ES 5. Display the resultant set S |
4. Results
4.1. Data Preprocessing
4.2. Generation of Relation Records Using Affinity Analysis
4.3. Filtering Relation Records Based on Maximum Co-Occurrence Value and Set Intersection Property
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gilpin, C.; Korobitsyn, A.; Migliori, G.B.; Raviglione, M.C.; Weyer, K. The World Health Organization standards for tuberculosis care and management. Eur. Respir. J. 2018, 51, 1800098. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mazumdar, S.; Satyanarayana, S.; Pai, M. Self-reported tuberculosis in India: Evidence from NFHS-4. BMJ Glob. Health 2019, 4, e001371. [Google Scholar] [CrossRef] [Green Version]
- Motschall, E.; Falck-Ytter, Y. Searching the MEDLINE Literature Database through PubMed: A Short Guide. Oncol. Res. Treat. 2005, 28, 517–522. [Google Scholar] [CrossRef] [PubMed]
- Koch, A.; Mizrahi, V. Mycobacterium tuberculosis. Trends Microbiol. 2018, 26, 555–556. [Google Scholar] [CrossRef]
- Sahbazian, B.; Weis, S.E. Treatment of Active Tuberculosis: Challenges and Prospects. Clin. Chest Med. 2005, 26, 273–282. [Google Scholar] [CrossRef]
- Khan, F.A.; Minion, J.; Pai, M.; Royce, S.; Burman, W.; Harries, A.D.; Menzies, D. Treatment of Active Tuberculosis in HIV-Coinfected Patients: A Systematic Review and Meta-Analysis. Clin. Infect. Dis. 2010, 50, 1288–1299. [Google Scholar] [CrossRef]
- Shi, R.; Itagaki, N.; Sugawara, I. Overview of anti-tuberculosis (TB) drugs and their resistance mechanisms. Mini-Rev. Med. Chem. 2007, 7, 1177–1185. [Google Scholar] [CrossRef]
- Kolyva, A.S.; Karakousis, P.C. Old and New TB Drugs: Mechanisms of Action and Resistance. In Understanding Tuberculosis—New Approaches to Fighting Against Drug Resistance; Books on Demand: Norderstedt, Germany, 2012. [Google Scholar] [CrossRef] [Green Version]
- Pantziarka, P.; Verbaanderd, C.; Huys, I.; Bouche, G.; Meheus, L. Repurposing drugs in oncology: From candidate selection to clinical adoption. Semin. Cancer Biol. 2021, 68, 186–191. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Jung, W.; Wang, F.; Che, C. Drug repurposing against Parkinson’s disease by text mining the scientific literature. Libr. Hi Tech 2020, 38, 741–750. [Google Scholar] [CrossRef]
- Jin, X.; Wu, Y. Study on Main Drugs and Drug Combinations of Patient-Controlled Analgesia Based on Text Mining. Pain Res. Manag. 2020, 2020, 1–7. [Google Scholar] [CrossRef]
- Naseem, U.; Khushi, M.; Khan, S.K.; Shaukat, K.; Moni, M.A. A comparative analysis of active learning for biomedical text mining. Appl. Syst. Innov. 2021, 4, 23. [Google Scholar] [CrossRef]
- Zhou, J.; Fu, B.-Q. The research on gene-disease association based on text-mining of PubMed. BMC Bioinform. 2018, 19, 1–8. [Google Scholar] [CrossRef]
- Rani, J.; Ramachandran, S. Pubmed. mineR: An R package with text-mining algorithms to analyse PubMed abstracts. J. Biosci. 2015, 40, 671–682. [Google Scholar] [CrossRef] [PubMed]
- Vazquez, M.; Krallinger, M.; Leitner, F.; Valencia, A. Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Mol. Inform. 2011, 30, 506–519. [Google Scholar] [CrossRef]
- Wang, H.; Ding, Y.; Tang, J.; Dong, X.; He, B.; Qiu, J.; Wild, D.J. Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA. PLoS ONE 2011, 6, e17243. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [Google Scholar] [CrossRef] [Green Version]
- Guan, R.; Wen, X.; Liang, Y.; Xu, N.; He, B.; Feng, X. Trends in Alzheimer’s Disease Research Based upon Machine Learning Analysis of PubMed Abstracts. Int. J. Biol. Sci. 2019, 15, 2065–2074. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vanteru, B.C.; Shaik, J.S.; Yeasin, M. Semantically linking and browsing PubMed abstracts with gene ontology. BMC Genom. 2008, 9, S10–S11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yadav, C.; Sharan, A. A New LSA and Entropy-Based Approach for Automatic Text Document Summarization. Int. J. Semant. Web Inf. Syst. 2018, 14, 1–32. [Google Scholar] [CrossRef] [Green Version]
- Kaur, M.; Sapra, R. Classification of patents by using the text mining approach based on PCA and logistics. Int. J. Eng. Adv. Technol. 2013, 2, 711–714. [Google Scholar]
- Lakshmi, K.; Krishna, M.; Kumar, S. Utilization of Data Mining Techniques for Prediction and Diagnosis of Tuberculosis Disease Survivability. Int. J. Mod. Educ. Comput. Sci. 2013, 5, 8–17. [Google Scholar] [CrossRef] [Green Version]
- Wang, X.; Wang, H.; Xie, J. Genes and regulatory networks involved in persistence of Mycobacterium tuberculosis. Sci. China Life Sci. 2011, 54, 300–310. [Google Scholar] [CrossRef] [Green Version]
- Asha, T.; Natarajan, S.; Murthy, K.N.B. Association-rule-based tuberculosis disease diagnosis. Second Int. Conf. Digit. Image Process. 2010, 7546, 75462. [Google Scholar] [CrossRef]
- Qaiser, S.; Ali, R. Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents. Int. J. Comput. Appl. 2018, 181, 25–29. [Google Scholar] [CrossRef]
- Jing, L.-P.; Huang, H.-K.; Shi, H.-B. Improved feature selection approach TFIDF in text mining. In Proceedings of the International Conference on Machine Learning and Cybernetics, Beijing, China, 4–5 November 2002; Volume 2, pp. 944–946. [Google Scholar] [CrossRef]
- Karthiyayini, R.; Balasubramanian, R. Affinity Analysis and Association Rule Mining using Apriori Algorithm in Market Basket Analysis. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 2016, 6, 241–246. [Google Scholar]
- Prajapati, D.J.; Garg, S.; Chauhan, N. Interesting association rule mining with consistent and inconsistent rule detection from big sales data in distributed environment. Futur. Comput. Inform. J. 2017, 2, 19–30. [Google Scholar] [CrossRef]
- Sanida, T.; Varlamis, I. Application of Affinity Analysis Techniques on Diagnosis and Prescription Data. In Proceedings of the 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS), Thessaloniki, Greece, 22–24 June 2017; pp. 403–408. [Google Scholar]
- Lempens, P.; Meehan, C.; Vandelannoote, K.; Fissette, K.; De Rijk, P.; Van Deun, A.; Rigouts, L.; De Jong, B.C. Isoniazid resistance levels of Mycobacterium tuberculosis can largely be predicted by high-confidence resistance-conferring mutations. Sci. Rep. 2018, 8, 1–9. [Google Scholar] [CrossRef]
- Bollela, V.R.; Namburete, E.I.; Feliciano, C.S.; Macheque, D.; Harrison, L.H.; Caminero, J.A. Detection of katG and inhA mutations to guide isoniazid and ethionamide use for drug-resistant tuberculosis. Int. J. Tuberc. Lung Dis. 2016, 20, 1099–1104. [Google Scholar] [CrossRef] [Green Version]
- Kaufman, G. Antibiotics: Mode of action and mechanisms of resistance. Nurs. Stand. 2011, 25, 49–55. [Google Scholar] [CrossRef]
- Blaser, M.J. Helicobacter pylori: Its Role in Disease. Clin. Infect. Dis. 1992, 15, 386–393. [Google Scholar] [CrossRef]
- Graham, D.Y. History of Helicobacter pylori, duodenal ulcer, gastric ulcer and gastric cancer. World J. Gastroenterol. 2014, 20, 5191–5204. [Google Scholar] [CrossRef]
- Barcenas, C.G.; Fuller, T.J.; Elms, J.; Cohen, R.; White, M.G. Staphylococcal sepsis in patients on chronic hemodialysis regimens: Intravenous treatment with vancomycin given once weekly. Arch. Intern. Med. 1976, 136, 1131–1134. [Google Scholar] [CrossRef]
- Callegan, M.C.; Ramirez, R.; Kane, S.T.; Cochran, D.C.; Jensen, H. Antibacterial activity of the fourth-generation fluoroquinolones gatifloxacin and moxifloxacin against ocular pathogens. Adv. Ther. 2003, 20, 246–252. [Google Scholar] [CrossRef] [PubMed]
- Montgomery, A.B.; Rhomberg, P.; Abuan, T.; Walters, K.A.; Flamm, R.K. Potentiation Effects of Amikacin and Fosfomycin against Selected Amikacin-Nonsusceptible Gram-Negative Respiratory Tract Pathogens. Antimicrob. Agents Chemother. 2014, 58, 3714–3719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Noskin, G.A.; Siddiqui, F.; Stosor, V.; Hacek, D.; Peterson, L.R. In Vitro Activities of Linezolid against Important Gram-Positive Bacterial Pathogens Including Vancomycin-Resistant Enterococci. Antimicrob. Agents Chemother. 1999, 43, 2059–2062. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Caroline, P.M.; Linezolid, J.B. A review of its use in the management of serious gram-positive infections. Drugs 2003, 63, 2126. [Google Scholar]
- Toossi, Z.; Mayanja-Kizza, H.; Hirsch, C.S.; Edmonds, K.L.; Spahlinger, T.; Hom, D.L.; Aung, H.; Mugyenyi, P.; Ellner, J.; Whalen, C.W. Impact of tuberculosis (TB) on HIV-1 activity in dually infected patients. Clin. Exp. Immunol. 2001, 123, 233–238. [Google Scholar] [CrossRef]
- McShane, H. Co-infection with HIV and TB: Double trouble. Int. J. STD AIDS 2005, 16, 95–101. [Google Scholar] [CrossRef]
- Nahm, U.Y.; Mooney, R.J. Text mining with information extraction. In Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases, Stanford, CA, USA, 25–27 March 2002; pp. 60–67. [Google Scholar]
- Yang, Y. An Evaluation of Statistical Approaches to Text Categorization. Inf. Retr. 1999, 1, 69–90. [Google Scholar] [CrossRef]
Name of the TB Drug | Number of Document Abstracts |
---|---|
Pyrazinamide | 1566 |
Moxifloxacin | 1947 |
Ethambutol | 1841 |
Isoniazid | 1896 |
Rifampicin | 2209 |
Linezolid | 1919 |
Streptomycin | 1954 |
Amikacin | 1909 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“activity”, “bactericidal”] | 0.026708 | 0.626506 | 5.892789 |
[“crossover”, “subjects”] | 0.010272 | 0.526316 | 11.778584 |
[“aeruginosa”, “p”] | 0.014895 | 0.547170 | 6.018868 |
[“genitalium”, “mycoplasma”] | 0.010786 | 0.875 | 32.143868 |
[“Gram-negative”, “Gram-positive”] | 0.017976 | 0.538462 | 14.167360 |
[“intraocular”, “surgery”] | 0.011299 | 0.564103 | 15.690110 |
[“isoniazid”, “pyrazinamide”] | 0.018490 | 0.590164 | 14.363115 |
[“methicillin-resistant”, “mrsa”] | 0.010786 | 0.777778 | 33.651852 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“acinetobacter”, “baumannii”] | 0.0230487 | 0.619718 | 11.485847 |
[“aeruginosa”, “pseudomonas”] | 0.023573 | 0.75 | 10.527574 |
[“avium”, “clarithromycin”] | 0.007333 | 0.518518 | 13.747942 |
[“breakpoints”, “isolates”] | 0.009953 | 0.575758 | 2.818259 |
[“cancer”, “patients”] | 0.008905 | 0.586207 | 3.806357 |
[“cerebrospinal”, “fluid”] | 0.009429 | 0.9 | 44.053846 |
[“adverse”, “events”] | 0.008381 | 0.727273 | 30.181818 |
[“coli”, “mirabilis”] | 0.007333 | 0.518518 | 5.209746 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“abdominal”, “pain”] | 0.016603 | 0.619047 | 15.892271 |
[“activity”, “antimycobacterial”] | 0.014049 | 0.611112 | 5.800001 |
[“antiretroviral”, “hiv”] | 0.012771 | 0.625 | 12.083334 |
[“cerebrospinal”, “fluid”] | 0.011494 | 0.899999 | 29.987234 |
[“cough”, “fever”] | 0.015964 | 0.625 | 15.535714 |
[“fluoroquinolone”, “mdr-tb”] | 0.012132 | 0.542857 | 7.143817 |
[“genes”, “mutations”] | 0.019157 | 0.535714 | 6.554129 |
[“hepatic”, “liver”] | 0.009578 | 0.576923 | 12.906593 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“antiretroviral”, “hiv”] | 0.013036 | 0.705882 | 12.866627 |
[“biopsy”, “patient”] | 0.024986 | 0.511112 | 4.376537 |
[“cell”, “wall”] | 0.025529 | 0.72307 | 12.558345 |
[“fluid”, “meningitis”] | 0.008691 | 0.551724 | 18.809706 |
[“hiv”, “virus”] | 0.012493 | 0.511112 | 14.167360 |
[“imaging”, “resonance”] | 0.009234 | 0.679999 | 27.81955 |
[“katg”, “mutations”] | 0.017381 | 0.680851 | 11.292313 |
[“lymph”, “nodes”] | 0.016295 | 0.967741 | 32.992831 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“abscessus”, “mycobacterium”] | 0.011985 | 0.696969 | 17.369933 |
[“anti-tb”, “drugs”] | 0.011464 | 0.758621 | 10.109674 |
[“aureus”, “methicillin-sensitive”] | 0.008337 | 0.761904 | 6.356935 |
[“bedaquiline”, “drug-resistant”] | 0.013548 | 0.553191 | 11.538852 |
[“ca-mrsa”, “ha-mrsa”] | 0.007816 | 0.882352 | 48.378151 |
[“chromosome”, “isolates”] | 0.011985 | 0.766666 | 4.008811 |
[“drugs”, “second-line”] | 0.013548 | 0.684211 | 9.118055 |
[“faecium”, “isolates”] | 0.019801 | 0.5 | 2.614441 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“acetate”, “extract”] | 0.008701 | 0.548387 | 17.566367 |
[“chromosome”, “genome”] | 0.007676 | 0.576923 | 7.320179 |
[“ethyl”, “extract”] | 0.008188 | 0.533333 | 17.084153 |
[“ribosomally”, “synthesized”] | 0.007676 | 0.882352 | 42.051649 |
[“crystal”, “structure”] | 0.013306 | 0.722223 | 11.959511 |
[“coli”, “e”] | 0.012794 | 0.641025 | 20.202646 |
[“biosynthesis”, “inactivation”] | 0.006653 | 0.565217 | 5.549923 |
[“aureus”, “methicillin-resistant”] | 0.009211 | 0.692307 | 32.994371 |
Item Sets | Support | Confidence | Lift |
---|---|---|---|
[“activity”, “bactericidal”] | 0.009053 | 0.645161 | 11.681649 |
[“antiretroviral”, “hiv”] | 0.008601 | 0.542857 | 12.491369 |
[“bacteria”, “Gram-negative”] | 0.009053 | 0.540541 | 9.950451 |
[“central”, “nervous”] | 0.006791 | 0.833333 | 40.018115 |
[“chest”, “x-ray”] | 0.009053 | 0.606061 | 21.947342 |
[“co-infection”, “hiv”] | 0.007243 | 0.64 | 14.726667 |
[“codon”, “isolates”] | 0.006791 | 0.75 | 7.363333 |
[“injury”, “liver”] | 0.01177 | 0.604651 | 19.935439 |
Item sets | Support | Confidence | Lift |
---|---|---|---|
[“activity”, “bactericidal”] | 0.010021 | 0.5 | 8.697247 |
[“care”, “hiv”] | 0.018459 | 0.5 | 6.236842 |
[“chest”, “x-ray”] | 0.008438 | 0.516129 | 17.168081 |
[“drug-induced”, “injury”] | 0.013713 | 0.541667 | 18.017543 |
[“fluoroquinolone”, “resistance”] | 0.009493 | 0.545454 | 3.262403 |
[“hepatic”, “liver”] | 0.010548 | 0.606061 | 13.207941 |
[“interferon-gamma”, “release”] | 0.009493 | 0.9 | 33.458823 |
[“nontuberculous”, “ntm”] | 0.008438 | 0.666666 | 48.615384 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sampath, P.; Sridhar, N.S.; Shanmuganathan, V.; Lee, Y. TREASURE: Text Mining Algorithm Based on Affinity Analysis and Set Intersection to Find the Action of Tuberculosis Drugs against Other Pathogens. Appl. Sci. 2021, 11, 6834. https://doi.org/10.3390/app11156834
Sampath P, Sridhar NS, Shanmuganathan V, Lee Y. TREASURE: Text Mining Algorithm Based on Affinity Analysis and Set Intersection to Find the Action of Tuberculosis Drugs against Other Pathogens. Applied Sciences. 2021; 11(15):6834. https://doi.org/10.3390/app11156834
Chicago/Turabian StyleSampath, Pradeepa, Nithya Shree Sridhar, Vimal Shanmuganathan, and Yangsun Lee. 2021. "TREASURE: Text Mining Algorithm Based on Affinity Analysis and Set Intersection to Find the Action of Tuberculosis Drugs against Other Pathogens" Applied Sciences 11, no. 15: 6834. https://doi.org/10.3390/app11156834
APA StyleSampath, P., Sridhar, N. S., Shanmuganathan, V., & Lee, Y. (2021). TREASURE: Text Mining Algorithm Based on Affinity Analysis and Set Intersection to Find the Action of Tuberculosis Drugs against Other Pathogens. Applied Sciences, 11(15), 6834. https://doi.org/10.3390/app11156834