Modeling Topics in DFA-Based Lemmatized Gujarati Text
Abstract
:1. Introduction
1.1. Motivation
1.2. Contribution of the Paper
- We propose a DFA-based lemmatization approach for Gujarati text.
- We show that lemmatization of Gujarati text reduces lemmas to their base words to curtail the vocabulary size notably.
- The topic can be inferred quicker in a lemmatized Gujarati corpus, resulting in improvement in the interpretability of the discovered topics at the same time.
- The semantic coherence measurement has been performed by three methods to analyze it precisely.
- Additionally, we have used two different measurement methods to show the distance between topics. We proved that meaningful and precise topics fall far from overly general topics. The distance of the meaningful topics from the token distribution of the entire corpus is also larger compared to that for overly general topics.
1.3. Organization of the Paper
1.4. Scope of the Paper
2. Related Work
3. Deterministic Finite Automata (DFA) Based Gujarati Lemmatizer
4. Latent Dirichlet Allocation (LDA)
Algorithm 1: Generative algorithm for LDA. |
5. Experimental Setup
5.1. Preprocessing and Vocabulary Size: An Analysis
5.2. Evaluation of the Proposed Lemmatizer Approach
5.3. Overly General Topics
Distance from a Global Corpus-Level Topic
5.4. Semantic Coherence Measurement Methods
5.4.1. Pointwise Mutual Information (PMI)
5.4.2. Normalized Pointwise Mutual Information (NPMI)
5.4.3. Log Conditional Probability (LCP)
5.5. Distance Measurement Methods
5.5.1. Hellinger Distance
5.5.2. Jaccard Distance
6. Results
6.1. Distance Measurement from Global Topic
6.2. The Semantic Coherence Score
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar] [CrossRef]
- Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391. [Google Scholar] [CrossRef]
- Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, 30 July–1 August 1999; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1999; pp. 289–296. [Google Scholar] [CrossRef]
- Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101, 5228–5235. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yau, C.K.; Porter, A.; Newman, N.; Suominen, A. Clustering scientific documents with topic modeling. Scientometrics 2014, 100, 767–786. [Google Scholar] [CrossRef]
- Rosen-Zvi, M.; Griffiths, T.; Steyvers, M.; Smyth, P. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, Banff, AB, Canada, 7–11 July 2004; AUAI Press: Arlington, VA, USA, 2004; pp. 487–494. [Google Scholar] [CrossRef]
- Steyvers, M.; Smyth, P.; Rosen-Zvi, M.; Griffiths, T. Probabilistic author-topic models for information discovery. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; ACM: New York, NY, USA, 2004; pp. 306–315. [Google Scholar]
- Lu, H.M.; Wei, C.P.; Hsiao, F.Y. Modeling healthcare data using multiple-channel latent Dirichlet allocation. J. Biomed. Inform. 2016, 60, 210–223. [Google Scholar] [CrossRef]
- Paul, M.J.; Dredze, M. Discovering health topics in social media using topic models. PLoS ONE 2014, 9, e103408. [Google Scholar] [CrossRef] [Green Version]
- Kayi, E.S.; Yadav, K.; Chamberlain, J.M.; Choi, H.A. Topic Modeling for Classification of Clinical Reports. arXiv 2017, arXiv:1706.06177. [Google Scholar]
- Yao, L.; Zhang, Y.; Wei, B.; Wang, W.; Zhang, Y.; Ren, X.; Bian, Y. Discovering treatment pattern in Traditional Chinese Medicine clinical cases by exploiting supervised topic model and domain knowledge. J. Biomed. Inform. 2015, 58, 260–267. [Google Scholar] [CrossRef] [Green Version]
- Asuncion, H.U.; Asuncion, A.U.; Taylor, R.N. Software traceability with topic modeling. In Proceedings of the 2010 ACM/IEEE 32nd International Conference on Software Engineering, Cape Town, South Africa, 2–8 May 2010; Volume 1, pp. 95–104. [Google Scholar]
- Chen, T.H.; Shang, W.; Nagappan, M.; Hassan, A.E.; Thomas, S.W. Topic-based software defect explanation. J. Syst. Softw. 2017, 129, 79–106. [Google Scholar] [CrossRef]
- Corley, C.S.; Damevski, K.; Kraft, N.A. Changeset-based topic modeling of software repositories. IEEE Trans. Softw. Eng. 2018, 46, 1068–1080. [Google Scholar] [CrossRef] [Green Version]
- Lukins, S.K.; Kraft, N.A.; Etzkorn, L.H. Bug localization using latent dirichlet allocation. Inf. Softw. Technol. 2010, 52, 972–990. [Google Scholar] [CrossRef]
- Řehůřek, R.; Sojka, P. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks; ELRA: Valletta, Malta, 2010; pp. 45–50. [Google Scholar]
- Sun, X.; Li, B.; Leung, H.; Li, B.; Li, Y. Msr4sm: Using topic models to effectively mining software repositories for software maintenance tasks. Inf. Softw. Technol. 2015, 66, 1–12. [Google Scholar] [CrossRef]
- Thomas, S.W.; Adams, B.; Hassan, A.E.; Blostein, D. Studying software evolution using topic models. Sci. Comput. Program. 2014, 80, 457–479. [Google Scholar] [CrossRef]
- Tian, K.; Revelle, M.; Poshyvanyk, D. Using latent dirichlet allocation for automatic categorization of software. In Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, Vancouver, BC, Canada, 16–17 May 2009; pp. 163–166. [Google Scholar]
- Vretos, N.; Nikolaidis, N.; Pitas, I. Video fingerprinting using Latent Dirichlet Allocation and facial images. Pattern Recognit. 2012, 45, 2489–2498. [Google Scholar] [CrossRef]
- Fernandez-Beltran, R.; Pla, F. Incremental probabilistic Latent Semantic Analysis for video retrieval. Image Vis. Comput. 2015, 38, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Yuan, B.; Gao, X.; Niu, Z.; Tian, Q. Discovering Latent Topics by Gaussian Latent Dirichlet Allocation and Spectral Clustering. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 25. [Google Scholar] [CrossRef]
- Hu, P.; Liu, W.; Jiang, W.; Yang, Z. Latent topic model for audio retrieval. Pattern Recognit. 2014, 47, 1138–1143. [Google Scholar] [CrossRef]
- Gao, N.; Gao, L.; He, Y.; Wang, H.; Sun, Q. Topic detection based on group average hierarchical clustering. In Proceedings of the 2013 International Conference on Advanced Cloud and Big Data, Nanjing, China, 13–15 December 2013; pp. 88–92. [Google Scholar] [CrossRef]
- Kim, D.; Oh, A. Hierarchical Dirichlet scaling process. In Proceedings of the International Conference on Machine Learning, Beijing, China, 22–24 June 2014; pp. 973–981. [Google Scholar]
- Li, W.; Yin, J.; Chen, H. Supervised Topic Modeling Using Hierarchical Dirichlet Process-Based Inverse Regression: Experiments on E-Commerce Applications. IEEE Trans. Knowl. Data Eng. 2017, 30, 1192–1205. [Google Scholar] [CrossRef]
- Teh, Y.W.; Jordan, M.I.; Beal, M.J.; Blei, D.M. Sharing clusters among related groups: Hierarchical Dirichlet processes. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005; pp. 1385–1392. [Google Scholar]
- Yang, S.; Yuan, C.; Hu, W.; Ding, X. A hierarchical model based on latent dirichlet allocation for action recognition. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 2613–2618. [Google Scholar] [CrossRef]
- Zhu, W.; Zhang, L.; Bian, Q. A hierarchical latent topic model based on sparse coding. Neurocomputing 2012, 76, 28–35. [Google Scholar] [CrossRef]
- Fang, A.; Macdonald, C.; Ounis, I.; Habel, P. Topics in tweets: A user study of topic coherence metrics for Twitter data. In Proceedings of the European Conference on Information Retrieval, Padua, Italy, 20 March 2016; Springer: Cham, Switzerland, 2016; pp. 492–504. [Google Scholar] [CrossRef] [Green Version]
- Weng, J.; Lim, E.P.; Jiang, J.; He, Q. Twitterrank: Finding topic-sensitive influential twitterers. In Proceedings of the Third ACM International Conference on Web Search and Data Mining, New York, NY, USA, 3–6 February 2010; ACM: New York, NY, USA, 2010; pp. 261–270. [Google Scholar]
- Bhattacharya, P.; Zafar, M.B.; Ganguly, N.; Ghosh, S.; Gummadi, K.P. Inferring user interests in the twitter social network. In Proceedings of the 8th ACM Conference on Recommender Systems, Foster City, CA, USA, 6–10 October 2014; ACM: New York, NY, USA, 2014; pp. 357–360. [Google Scholar]
- Cordeiro, M. Twitter event detection: Combining wavelet analysis and topic inference summarization. In Proceedings of the Doctoral Symposium on Informatics Engineering; Faculdade de Engenharia da Universidade do Porto: Porto, Portugal, 2012; pp. 11–16. [Google Scholar]
- Kim, Y.; Shim, K. TWILITE: A recommendation system for Twitter using a probabilistic model based on latent Dirichlet allocation. Inf. Syst. 2014, 42, 59–77. [Google Scholar] [CrossRef]
- Lansley, G.; Longley, P.A. The geography of Twitter topics in London. Comput. Environ. Urban Syst. 2016, 58, 85–96. [Google Scholar] [CrossRef] [Green Version]
- Ren, Y.; Wang, R.; Ji, D. A topic-enhanced word embedding for twitter sentiment classification. Inf. Sci. 2016, 369, 188–198. [Google Scholar] [CrossRef]
- Ma, B.; Zhang, D.; Yan, Z.; Kim, T. An LDA and synonym lexicon based approach to product feature extraction from online consumer product reviews. J. Electron. Commer. Res. 2013, 14, 304. [Google Scholar] [CrossRef]
- Hashimoto, K.; Kontonatsios, G.; Miwa, M.; Ananiadou, S. Topic detection using paragraph vectors to support active learning in systematic reviews. J. Biomed. Inform. 2016, 62, 59–65. [Google Scholar]
- Kim, S.; Zhang, J.; Chen, Z.; Oh, A.; Liu, S. A hierarchical aspect-sentiment model for online reviews. Proc. Aaai Conf. Artif. Intell. 2013, 27, 526–533. [Google Scholar]
- Schofield, A.; Magnusson, M.; Mimno, D. Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Short Papers; Association for Computational Linguistics: Valencia, Spain, 2017; Volume 2, pp. 432–436. [Google Scholar]
- Brahmi, A.; Ech-Cherif, A.; Benyettou, A. Arabic texts analysis for topic modeling evaluation. Inf. Retr. 2012, 15, 33–53. [Google Scholar] [CrossRef]
- Lu, K.; Cai, X.; Ajiferuke, I.; Wolfram, D. Vocabulary size and its effect on topic representation. Inf. Process. Manag. 2017, 53, 653–665. [Google Scholar] [CrossRef]
- Paul, S.; Tandon, M.; Joshi, N.; Mathur, I. Design of a rule based Hindi lemmatizer. In Proceedings of Third International Workshop on Artificial Intelligence, Soft Computing and Applications, Chennai, India, 27 July 2013; AIRCC Publishing Corporation: Tamil Nadu, India, 2013; pp. 67–74. [Google Scholar]
- Chakrabarty, A.; Garain, U. Benlem (A bengali lemmatizer) and its role in WSD. ACM Trans. Asian-Low-Resour. Lang. Inf. Process. (TALLIP) 2016, 15, 1–18. [Google Scholar] [CrossRef]
- Kumar, A.M.; Soman, K. AMRITA_CEN@ FIRE-2014: Morpheme Extraction and Lemmatization for Tamil using Machine Learning. In Proceedings of the Forum for Information Retrieval Evaluation, Bangalore, India, 5–7 December 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 112–120. [Google Scholar]
- Al-Shammari, E.; Lin, J. A novel Arabic lemmatization algorithm. In Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, Association for Computing Machinery, New York, NY, USA, 24 July 2008; pp. 113–118. [Google Scholar]
- Al-Shammari, E.T.; Lin, J. Towards an error-free Arabic stemming. In Proceedings of the 2nd ACM Workshop on Improving non English Web Searching, Napa Valley, CA, USA, 30 October 2008; pp. 9–16. [Google Scholar]
- Roth, R.; Rambow, O.; Habash, N.; Diab, M.; Rudin, C. Arabic morphological tagging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of the ACL-08: HLT, Short Papers; Association for Computational Linguistics: Valencia, Spain, 2008; pp. 117–120. [Google Scholar]
- Seddah, D.; Chrupała, G.; Çetinoğlu, Ö.; Van Genabith, J.; Candito, M. Lemmatization and lexicalized statistical parsing of morphologically rich languages: The case of French. In Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages; Association for Computational Linguistics: Valencia, Spain, 2010; pp. 85–93. [Google Scholar]
- Piskorski, J.; Sydow, M.; Kupść, A. Lemmatization of Polish person names. In Proceedings of the Workshop on Balto-Slavonic Natural Language Processing, Prague, Czech Republic, 29 June 2007; pp. 27–34. [Google Scholar]
- Korenius, T.; Laurikkala, J.; Järvelin, K.; Juhola, M. Stemming and lemmatization in the clustering of finnish text documents. In Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, Washington, DC, USA, 8–13 November 2004; pp. 625–633. [Google Scholar]
- Kučera, K.; Stluka, M. Data processing and lemmatization in digitized 19th-century Czech texts. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, Madrid, Spain, 19–20 May 2014; pp. 193–196. [Google Scholar]
- Eger, S.; Gleim, R.; Mehler, A. Lemmatization and morphological tagging in German and Latin: A comparison and a survey of the state-of-the-art. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1507–1513. [Google Scholar]
- Lazarinis, F. Lemmatization and stopword elimination in Greek Web searching. In Proceedings of the 2007 Euro American conference on Telematics and Information Systems, Faro, Portugal, 14–17 May 2007; pp. 1–4. [Google Scholar]
- Rakhimova, D.; Turganbayeva, A. Lemmatization of big data in the Kazakh language. In Proceedings of the 5th International Conference on Engineering and MIS, Astana, Kazakhstan, 6–8 June 2019; pp. 1–4. [Google Scholar]
- Ozturkmenoglu, O.; Alpkocak, A. Comparison of different lemmatization approaches for information retrieval on Turkish text collection. In Proceedings of the 2012 International Symposium on Innovations in Intelligent Systems and Applications, Trabzon, Turkey, 2–4 July 2012; pp. 1–5. [Google Scholar]
- Toporkov, O.; Agerri, R. On the Role of Morphological Information for Contextual Lemmatization. arXiv 2023, arXiv:2302.00407. [Google Scholar]
- Hafeez, R.; Anwar, M.W.; Jamal, M.H.; Fatima, T.; Espinosa, J.C.M.; López, L.A.D.; Thompson, E.B.; Ashraf, I. Contextual Urdu Lemmatization Using Recurrent Neural Network Models. Mathematics 2023, 11, 435. [Google Scholar] [CrossRef]
- Gogoi, A.; Baruah, N. A Lemmatizer for Low-resource Languages: WSD and Its Role in the Assamese Language. Trans. Asian-Low-Resour. Lang. Inf. Process. 2022, 21, 1–22. [Google Scholar] [CrossRef]
- Freihat, A.A.; Abbas, M.; Bella, G.; Giunchiglia, F. Towards an optimal solution to lemmatization in Arabic. Procedia Comput. Sci. 2018, 142, 132–140. [Google Scholar] [CrossRef]
- Porter, M. The Porter Stemming Algorithm (1980). Available online: http://tartarus.org/martin/PorterStemmer (accessed on 9 September 2022).
- Wikipedia Contributors. Gujarati Language—Wikipedia, the Free Encyclopedia. 2021. Available online: https://en.wikipedia.org/wiki/Gujarati_language (accessed on 4 December 2021).
- Suba, K.; Jiandani, D.; Bhattacharyya, P. Hybrid inflectional stemmer and rule-based derivational stemmer for gujarati. In Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP), Chiang Mai, Thailand, 8–13 November 2011; pp. 1–8. [Google Scholar]
- Ameta, J.; Joshi, N.; Mathur, I. A lightweight stemmer for Gujarati. arXiv 2012, arXiv:1210.5486. [Google Scholar]
- Aswani, N.; Gaizauskas, R.J. Developing Morphological Analysers for South Asian Languages: Experimenting with the Hindi and Gujarati Languages. In Proceedings of the LREC, Valletta, Malta, 17–23 May 2010. [Google Scholar]
- Popat, P.P.K.; Bhattacharyya, P. Hybrid stemmer for gujarati. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 August 2010; p. 51. [Google Scholar]
- Wallach, H.M.; Murray, I.; Salakhutdinov, R.; Mimno, D. Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on Machine Learning, Montreal, QC, Canada, 14–18 June 2009; pp. 1105–1112. [Google Scholar]
- Lau, J.H.; Newman, D.; Baldwin, T. Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality. EACL 2014, 530–539. [Google Scholar] [CrossRef]
- Aletras, N.; Stevenson, M. Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013)–Long Papers; Association for Computational Linguistics: Potsdam, Germany, 2013; pp. 13–22. [Google Scholar]
Topic 1 | Topic 2 | Topic 3 | Topic 4 | ||||
---|---|---|---|---|---|---|---|
Word | Prob. | Word | Prob. | Word | Prob. | Word | Prob. |
Ballot | 0.073 | NYSE | 0.104 | Gym | 0.064 | Fire | 0.081 |
Voting | 0.071 | Predict | 0.082 | Guideline | 0.062 | Fundamental | 0.079 |
Poll | 0.069 | Profitability | 0.082 | Diet | 0.060 | Force | 0.077 |
Booth | 0.064 | NASDAQ | 0.073 | Fitness | 0.060 | Galaxy | 0.077 |
Campaign | 0.062 | Negotiable | 0.073 | Grains | 0.059 | Earth | 0.077 |
Election | 0.060 | Profit | 0.073 | Growth | 0.059 | Experimental | 0.075 |
Democracy | 0.057 | Peak | 0.068 | Doctor | 0.057 | Energy | 0.069 |
Leadership | 0.053 | Portfolio | 0.062 | Yoga | 0.055 | Explosion | 0.063 |
Elector | 0.050 | Price | 0.061 | Health | 0.055 | Star | 0.063 |
Author | Language | Application | Pub. Year | Approach | Accuracy | No. of Tokens |
---|---|---|---|---|---|---|
[59] | Assamese | Word Sense Disambiguation | 2022 | Rule-based | 82 | 50,000 |
[60] | Arabic | Annotation | 2018 | Dictionary-based | 98.6 | 46,018 |
[44] | Bengali | Word Sense Disambiguation | 2016 | Rule-based | 96.99 | 6341 |
[43] | Hindi | Time Complexity | 2013 | Rule-based | 89.02 | 2500 |
[49] | French | Pos Tagging | 2010 | Rule-based | 99.28 | 350,931 |
[55] | Kazakh | Information Retrieval | 2019 | Rule-based | N/A | N/A |
[46] | Arabic | Lexem Models | 2018 | Feature Ranking | N/A | N/A |
Word | Stemming | Lemmatization |
---|---|---|
Information | Inform | Information |
Informative | Inform | Informative |
Computers | Comput | Computer |
Feet | Feet | Foot |
Sr. No | Rule Name | How Many Letters / Characters to Check | Letters | What to Delete from Word | What to Add after Deletion | Example |
---|---|---|---|---|---|---|
1 | Check if the last letter is ’ ો ’ | last 3 characters | ’ ો ’, ’ન’, ’ ો ’, | last 3 characters | NA | મહાપુરુષોનો = મહાપુરુષ |
2 | Check if the last letter is ’ો ’ | last 4 characters | ો ’, ’ન’, ’ઓ’, ’ા’ | last 4 characters | ’ ો ’ | છોકરાઓનો = છોકરો |
3 | Check if the last letter is ’ો | last 4 characters | ’ ો ’, ’ન’, ’ઓ’ not ’ા’ | last 3 characters | NA | છોકરીઓનો = છોકરી |
4 | Check if the last letter is ’ો | last 2 characters | ’ ો ’, ’ન’ | ’ ો ’ and check the remaining word with the words in the n-ending words file. If a match occurs, then print the word; else, remove ’ ો ’, ’ન’ | NA | વાહનો = વાહન |
5 | Check if the last letter is ’ો | last 2 characters | ’ ો ’, ’ન’ | last 2 characters | NA | સીતાનો = સીતા |
6 | Check if the last letter is ’ી’ | last 3 characters | ’ી’, ’ન’, ’ ો | last 3 characters | NA | મહાપુરુષોની = મહાપુરુષ |
7 | Check if the last letter is ’ી’ | last 4 characters | ’ી’, ’ન’, ’ઓ’, ’ા’ | last 4 characters | ‘ ો ’ | છોકરાઓની = છોકરો |
8 | Check if the last letter is ’ી’ | last 4 characters | ’ી’, ’ન’, ’ઓ’ and not ’ા’ | last 3 characters | NA | છોકરીઓની = છોકરી |
9 | Check if the last letter is ’ી’ | last 2 characters | ‘ી’, ’ન’ | last 2 characters | NA | સીતાની = સીતા |
10 | Check if the last letter is ’ી’ | last 3 characters | ’ી’, ’ થ ’, ’ ો ’ | last 3 characters | NA | મહાપુરુષોથી = મહાપુરુષ |
Sr. No | Rule Name | How Many Letters / Characters to Check | Letters | What to Delete from Word | What to Add after Deletion | Example |
---|---|---|---|---|---|---|
1 | Check if the last letter is ’ુ’: | last 3 characters | ‘ુ ’,‘ ય’, ‘્’ | last 3 characters | ‘વ’, ‘ુ ’ | બન્યુ =બનવુ |
2 | Check if the last letter is ’ુ’: | last 4 characters | ’લ’, ’ે’, ’ય’, ’ા’ | last 4 characters | ‘વ’, ‘ુ ’ | સંતાડાયેલુ =સંતાડવુ |
3 | Check if the last letter is ’ુ’: | last 2 characters | ‘ુ ’, ‘વ’ | last 2 characters | NA | રમવુ= રમ |
4 | Check if the last letter is ’ુ’: | last 2 characters | ‘ુ ’, ‘ત’ | last 2 characters | NA | રમતુ= રમ |
5 | Check if the last letter is ’ુ’: | last 2 characters | ‘ુ ’, ‘મ’ | last 2 characters | NA | પાંચમુ = પાંચ |
6 | Check if the last letter is ’ુ’: | last 2 characters | ‘ુ ’, ‘શ’ | last 3 characters | NA | આવીશુ = આવ |
7 | Check if the last letter is ’ુ’: | last 3 characters | ’ુ’, ’ન’, ’ ો ’ | last 3 characters | NA | મહાપુરુષોનુ = મહાપુરુષ |
8 | Check if the last letter is ’ુ’ : | last 4 characters | ’ુ’, ’ન’, ’ઓ’, ’ા’ | last 4 characters | ‘ ો’ | છોકરાઓનુ = છોકરો |
9 | Check if the last letter is ’ુ’: | last 4 characters | ’ુ’, ’ન’, ’ઓ’ not ’ા’ | last 3 characters | NA | છોકરીઓનુ = છોકરી |
10 | Check if the last letter is ’ુ’: | last 2 characters | ’ુ’, ’ન’ | last 2 characters | NA | સીતાનુ = સીતા |
Preprocessing Steps | No. of Tokens | Vocabulary Size | % of Tokens in Vocabulary | TTR |
---|---|---|---|---|
After tokenization | 1,167,630 | 89,696 | 7.681885529 | 0.077 |
After stopwords removal | 870,521 | 89,003 | 10.22410717 | 0.102 |
After punctuation removal | 746,292 | 889.87 | 11.92388502 | 0.119 |
Alphanumeric to alphabetic word | 746,292 | 86,271 | 11.5599524 | 0.116 |
After single-letter word removal | 620,133 | 86,098 | 13.8837959 | 0.139 |
After lemmatization | 620,133 | 50,043 | 8.069720528 | 0.081 |
Word | Probability | Word | Probability |
---|---|---|---|
’કે’ (Kē / Whether) | 0.003833333 | ’જો’ (Jō / If) | 0.000750000 |
’છે’ (Chhē / Is) | 0.025166667 | ’જ’ (Ja / Only) | 0.002500000 |
’જે’ (Jē / Whom) | 0.001083333 | ’ન’ (Na / No) | 0.001000000 |
’તે’ (Tē / That) | 0.001666667 | ’બે’ (Bē / Two) | 0.001166667 |
’એ’ (Ē / That) | 0.000833333 | ’તો’ (Tō / Then) | 0.001833333 |
’આ’ (Ā / This) | 0.004250000 | ’હું’ (Huṁ / I) | 0.000416667 |
’છો’ (Chho / Is) | 0.000833333 | ’શ્રી’ (Shree / Mr.) | 0.001583333 |
Word | Frequency | Word | Frequency |
---|---|---|---|
ટેક્સ (Ṭēksa/Tax) | 231 | જાહેર (Jāhēra/Public) | 67 |
વરસાદ (Varasāda/Rain) | 191 | પ્રોજેક્ટ (Prōjēkṭa/Project) | 57 |
ગુજરાત (Gujarāta/Gujarat) | 189 | રકમ (Rakama/Amount) | 46 |
જાહેર (Jāhēra/Public) | 182 | જમીન (Jamin/Soil) | 45 |
સરકાર (Sarakāra/Government) | 170 | યોગ (Yoga/Yoga) | 39 |
યોગ (Yōga/Yoga) | 147 | ગુજરાતમાં (Gujarātamām/In Gujarat) | 35 |
શરૂ (Śarū/Start) | 138 | પ્લાન (Plan/Plan) | 31 |
ભારતીય (Bhāratīya/Indian) | 136 | વર્ષે (Varṣē/Year) | 30 |
ભારત (Bhārata/India) | 126 | શક્તિ (Śakti/Power) | 27 |
બુલેટ (Bulēṭa/Bullet) | 121 | એફઆઇઆઈ (Ēpha’ā’ī’ā’ī/FII) | 25 |
પાણી (Pāṇī/Water) | 118 | સમયસર (Samaysara/On time) | 25 |
અમદાવાદ (Amadāvāda/Ahmedabad) | 114 | મહત્ત્વ (Mahtava/Importance) | 24 |
પ્રવેશ (Pravēśa/Entry) | 112 | વિધુર (Vidhura/Widower) | 19 |
તલાક (Talāka/Divorce) | 112 | મુંબઈ (Mumba/Mumbai) | 18 |
સ્માર્ટફોન (Smārṭaphōna/Smartphone) | 108 | ⋯ | ⋯ |
નિર્ણય (Nirṇaya/Decision) | 107 | ⋯ | ⋯ |
બાહુબલી (Bāhubalī/Bahubali) | 106 | ⋯ | ⋯ |
Word | Frequency | Word | Frequency |
---|---|---|---|
ધર્મ (Dharma/Religion) | 86 | સાક્ષાત (Sākṣāta/Confirmed) | 18 |
આનંદ (Ānanda/Happiness) | 41 | પૂજાપાઠ (Pūjāpāṭha/Worship) | 18 |
ઈશ્વર (Īśvara/God) | 37 | જાગૃતિ (Jāgrti/Awareness) | 18 |
યહુદી (Yahudī/Jew) | 32 | સાંપ્રદાયિક (Sāmpradāyika/Sectarian) | 18 |
કર્મકાંડ (Karmakāṇḍa/Ritual) | 22 | પ્રેમ (Prēma/Love) | 18 |
નૈતિક (Naitika/Moral) | 21 | ખ્રિસ્તી (Khristī/Christian) | 16 |
શ્રધ્ધા (Śrad’dhā/Devotion) | 21 | ઇસ્લામ (Islāma/Islam) | 16 |
આધ્યાત્મિક (Ādhyātmika/Spiritual) | 20 | જીવન (Jīvana/Life) | 9 |
ધાર્મિક (Dhārmika/Religious) | 20 | પ્રત્યે (Pratyē/Towards) | 8 |
મુલ્યો (Mulyō/Values) | 19 | માણસ (Māṇasa/Human) | 8 |
No. of Tokens | Vocabulary | Inference Time (in Seconds) | Distance Measurement | |||
---|---|---|---|---|---|---|
Unlemmatized | Lemmatized | |||||
Hellinger | Jaccard | Hellinger | Jaccard | |||
604,389 | 85,463 | 33.14 | 0.476 | 0.970 | 0.491 | 0.993 |
561,648 | 42,722 | 29.63 | 0.495 | 0.968 | 0.546 | 0.998 |
531,870 | 27,533 | 26.77 | 0.481 | 0.982 | 0.520 | 0.996 |
512,085 | 21,238 | 22.15 | 0.489 | 0.982 | 0.517 | 0.999 |
496,373 | 17,310 | 18.92 | 0.495 | 0.982 | 0.528 | 1.000 |
483,108 | 14,657 | 16.55 | 0.492 | 0.983 | 0.526 | 0.999 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chauhan, U.; Shah, S.; Shiroya, D.; Solanki, D.; Patel, Z.; Bhatia, J.; Tanwar, S.; Sharma, R.; Marina, V.; Raboaca, M.S. Modeling Topics in DFA-Based Lemmatized Gujarati Text. Sensors 2023, 23, 2708. https://doi.org/10.3390/s23052708
Chauhan U, Shah S, Shiroya D, Solanki D, Patel Z, Bhatia J, Tanwar S, Sharma R, Marina V, Raboaca MS. Modeling Topics in DFA-Based Lemmatized Gujarati Text. Sensors. 2023; 23(5):2708. https://doi.org/10.3390/s23052708
Chicago/Turabian StyleChauhan, Uttam, Shrusti Shah, Dharati Shiroya, Dipti Solanki, Zeel Patel, Jitendra Bhatia, Sudeep Tanwar, Ravi Sharma, Verdes Marina, and Maria Simona Raboaca. 2023. "Modeling Topics in DFA-Based Lemmatized Gujarati Text" Sensors 23, no. 5: 2708. https://doi.org/10.3390/s23052708
APA StyleChauhan, U., Shah, S., Shiroya, D., Solanki, D., Patel, Z., Bhatia, J., Tanwar, S., Sharma, R., Marina, V., & Raboaca, M. S. (2023). Modeling Topics in DFA-Based Lemmatized Gujarati Text. Sensors, 23(5), 2708. https://doi.org/10.3390/s23052708