Leveraging Generative AI in Short Document Indexing
Abstract
:1. Introduction
2. Background and Literature Review
2.1. The Indexing Task
2.2. Content-Based Indexing and Document Representation in IR
3. Materials and Methods
3.1. Motivation
3.2. Research Proposal
- Data selection: for each document, a list of tokens and concepts is selected in this step using traditional preprocessing tasks and a generative AI. Depending on the used LLM model, the suggested terms may differ from document terms. For this reason, we use both techniques to minimize any loss in the selected terms.
- Data augmentation: the resulting list of tokens from the preprocessing task is augmented with new terms suggested by the generative AI. Duplicated terms are removed. Compound words are considered unique terms.
- Data storage: the resulting list of index terms is stored in an inverted index structure.
- Selection of a generative AI: We conducted a benchmark of LLMs to evaluate their term suggestion capabilities, ultimately choosing the one best suited for our purpose.
- Selection of IR strategies: To assess the effectiveness of the index terms during document retrieval experiments, we chose two types of term vectors to perform the retrieval task: traditional vectors and enriched term vectors. These vectors were used to match query terms with document terms during experimentation.
- Data selection: This step consisted of selecting short documents in domain specific fields to apply the GenAI-based indexing approach.
- Experimentation for indexing and retrieval tasks: To validate the core concept of the proposal (i.e., enhancing traditional indexing with LLM-generated terms), we compared the performance of the GenAI-based indexing approach against other indexing methods.
3.3. Generative AI Selection
3.3.1. AI Tools’ Characteristics
3.3.2. Benchmark for Key Term Suggestion
3.3.3. Conclusion of the Benchmark
3.4. IR Experimentation Strategies
3.5. Data Selection
4. Experimental Results
4.1. Indexing Results
4.2. Retrieval Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix B
Appendix C
Appendix D
Appendix E
Field | Number of Words | Link |
---|---|---|
Electronics | 132 | “https://www.mdpi.com/2079-9292/12/6/1507 (accessed on 10 June 2024)” |
Electronics | 184 | “https://www.mdpi.com/2079-9292/12/6/1505 (accessed on 10 June 2024)” |
Electronics | 223 | “https://www.mdpi.com/2079-9292/12/6/1502 (accessed on 10 June 2024)” |
Electronics | 200 | “https://www.mdpi.com/2079-9292/12/5/1262 (accessed on 10 June 2024)” |
Electronics | 219 | “https://www.mdpi.com/2079-9292/12/5/1247 (accessed on 10 June 2024)” |
Biomedicines | 140 | “https://www.mdpi.com/2227-9059/11/1/211 (accessed on 10 June 2024)” |
Biomedicines | 140 | “https://www.mdpi.com/2227-9059/11/1/208 (accessed on 10 June 2024)” |
Biomedicines | 211 | “https://www.mdpi.com/2227-9059/11/1/193 (accessed on 10 June 2024)” |
Biomedicines | 224 | “https://www.mdpi.com/2227-9059/11/1/189 (accessed on 10 June 2024)” |
Biomedicines | 203 | “https://www.mdpi.com/2227-9059/11/1/184 (accessed on 10 June 2024)” |
Mathematics | 214 | “https://www.mdpi.com/2227-7390/11/1/254 (accessed on 10 June 2024)” |
Mathematics | 189 | “https://www.mdpi.com/2227-7390/11/1/245 (accessed on 10 June 2024)” |
Mathematics | 208 | “https://www.mdpi.com/2227-7390/11/1/235 (accessed on 10 June 2024)” |
Mathematics | 227 | “https://www.mdpi.com/2227-7390/11/3/783 (accessed on 10 June 2024)” |
Mathematics | 226 | “https://www.mdpi.com/2227-7390/11/3/768 (accessed on 10 June 2024)” |
References
- Guo, J.; Cai, Y.; Fan, Y.; Sun, F.; Zhang, R.; Cheng, X. Semantic Models for the First-Stage Retrieval: A Comprehensive Review. ACM Trans. Inf. Syst. 2021, 40, 1–42. [Google Scholar] [CrossRef]
- Carrillo, M.; Villatoro-Tello, E.; Lopez-Lopez, A.; Eliasmith, C.; Montes-y-Gomez, M.; Villasenõr-Pineda, L. Representing Context Information for Document Retrieval. In Proceedings of the International Conference on Flexible Query Answering Systems, Roskilde, Denmark, 26–28 October 2009; pp. 239–250. [Google Scholar]
- Reddy, Y.V.B.; Reddy, S.N.; Reddy, S.S.S.N. Efficient Web-Information Retrieval Systems and Web Search Engines: A Survey. Int. J. Mech. Eng. Technol. 2017, 25, 123–125. [Google Scholar]
- Tang, Y.; Zhang, R.; Guo, J.; Chen, J.; Zhu, Z.; Wang, S.; Yin, D.; Cheng, X. Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies. In Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 4904–4913. [Google Scholar]
- Asim, M.N.; Wasim, M.; Khan, M.U.G.; Mahmood, N.; Mahmood, W. The Use of Ontology in Retrieval: A Study on Textual, Multilingual, and Multimedia Retrieval. IEEE Access 2019, 7, 21662–21686. [Google Scholar] [CrossRef]
- NIST TREC Data. Available online: https://trec.nist.gov/data.html (accessed on 20 July 2024).
- Efron, M.; Organisciak, P.; Fenlon, K. Improving Retrieval of Short Texts through Document Expansion. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, Portland, OR, USA, 12–16 August 2012; pp. 911–920. [Google Scholar]
- Kozlowski, M.; Rybinski, H. Clustering of Semantically Enriched Short Texts. J. Intell. Inf. Syst. 2019, 53, 69–92. [Google Scholar] [CrossRef]
- Bouzid, S. A Bottom-up Semantic Mapping Approach for Exploring Manufacturing Information Resources in Industry. Comput. Syst. Sci. Eng. 2017, 32, 243–256. [Google Scholar]
- Jiang, Y. Semantically-Enhanced Information Retrieval Using Multiple Knowledge Sources. Clust. Comput. 2020, 23, 2925–2944. [Google Scholar] [CrossRef]
- Tang, M.; Chen, J.; Chen, H.; Xu, Z.; Wang, Y.; Xie, M.; Lin, J. An Ontology-Improved Vector Space Model for Semantic Retrieval. Electron. Libr. 2020, 38, 919–942. [Google Scholar] [CrossRef]
- Ormeño, P.; Mendoza, M.; Valle, C. Topic Models Ensembles for Ad-Hoc Information Retrieval. Information 2021, 12, 360. [Google Scholar] [CrossRef]
- Yu, B. Research on Information Retrieval Model Based on Ontology. EURASIP J. Wirel. Commun. Netw. 2019, 1, 30. [Google Scholar] [CrossRef]
- Jain, S.; Seeja, K.R.; Jindal, R. A Fuzzy Ontology Framework in Information Retrieval Using Semantic Query Expansion. Int. J. Inf. Manag. Data Insights 2021, 1, 100009. [Google Scholar] [CrossRef]
- Boukhari, K.; Omri, M.N. DL-VSM Based Document Indexing Approach for Information Retrieval. J. Ambient. Intell. Humaniz. Comput. 2023, 14, 5383–5394. [Google Scholar] [CrossRef]
- Sharma, A.; Kumar, S. Machine Learning and Ontology-Based Novel Semantic Document Indexing for Information Retrieval. Comput. Ind. Eng. 2023, 176, 108940. [Google Scholar] [CrossRef]
- Aliwy, A.; Abbas, A.; Alkhayyat, A. NERWS: Towards Improving Information Retrieval of Digital Library Management System Using Named Entity Recognition and Word Sense. Big Data Cogn. Comput. 2021, 5, 59. [Google Scholar] [CrossRef]
- Shakeri, M.; Sadeghi-Niaraki, A.; Choi, S.M.; AbuHmed, T. AR Search Engine: Semantic Information Retrieval for Augmented Reality Domain. Sustainability 2022, 14, 15681. [Google Scholar] [CrossRef]
- Sunny, S.K.; Angadi, M. Evaluating the Effectiveness of Thesauri in Digital Information Retrieval Systems. Electron. Libr. 2018, 36, 55–70. [Google Scholar] [CrossRef]
- Bedmar, I.S.; Martínez, P.; Martín, A.C. Search and Graph Database Technologies for Biomedical Semantic Indexing: Experimental Analysis. JMIR Med. Inform. 2017, 5, e7059. [Google Scholar] [CrossRef]
- Hussain, M.J.; Bai, H.; Wasti, S.H.; Huang, G.; Jiang, Y. Evaluating Semantic Similarity and Relatedness between Concepts by Combining Taxonomic and Non-Taxonomic Semantic Features of WordNet and Wikipedia. Inf. Sci. 2023, 625, 673–699. [Google Scholar] [CrossRef]
- Azad, H.K.; Deepak, A. A New Approach for Query Expansion Using Wikipedia and WordNet. Inf. Sci. 2019, 492, 147–163. [Google Scholar] [CrossRef]
- Asudani, D.S.; Nagwani, N.K.; Singh, P. Impact of Word Embedding Models on Text Analytics in Deep Learning Environment: A Review. Artif. Intell. Rev. 2023, 56, 10345–10425. [Google Scholar] [CrossRef]
- Ahmed, S.F.; Alam, M.S.B.i.n.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Shawkat Ali, A.B.M.; Gandomi, A.H. Deep Learning Modelling Techniques: Current Progress, Applications, Advantages, and Challenges; Springer: Dordrecht, The Netherlands, 2023; Volume 56, ISBN 0123456789. [Google Scholar]
- Mhawi, D.N.; Oleiwi, H.W.; Saeed, N.H.; Al-Taie, H.L. An Efficient Information Retrieval System Using Evolutionary Algorithms. Network 2022, 2, 583–605. [Google Scholar] [CrossRef]
- Wang, J.; Yang, Z.; Cheng, Z. Deep Pre-Training Transformers for Scientific Paper Representation. Electronics 2024, 13, 2123. [Google Scholar] [CrossRef]
- Surden, H. Chatgpt, Ai Large Language Models, and Law. Fordham Law Rev. 2023, 92, 1941–1972. [Google Scholar]
- Anthropic Claude 3.5 Sonnet. Available online: https://www.anthropic.com/news/claude-3-5-sonnet (accessed on 14 August 2024).
- Pichai, S.; Hassabis, D. Our Next-Generation Model: Gemini 1.5. Available online: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#gemini-15 (accessed on 25 July 2024).
- Golub, K. Automated Subject Indexing: An Overview. Cat. Classif. Q. 2021, 59, 702–719. [Google Scholar] [CrossRef]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st ed.; Prentice Hall PTR: Hoboken, NJ, USA, 2000; ISBN 0130950696. [Google Scholar]
- Singh, J.; Gupta, V. A Systematic Review of Text Stemming Techniques. Artif. Intell. Rev. 2017, 48, 157–217. [Google Scholar] [CrossRef]
- Balakrishnan, V.; Humaidi, N.; Lloyd-Yemoh, E. Improving Document Relevancy Using Integrated Language Modeling Techniques. Malays. J. Comput. Sci. 2016, 29, 45–55. [Google Scholar] [CrossRef]
- Salton, G.; Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef]
- Robertson, S.E.; Walker, S.; Beaulieu, M.M.; Gatford, M.; Payne, A. Okapi at TREC-4. In Proceedings of the 4th Text Retrieval Conference, Gaithersburg, MD, USA, 1–3 November 1995; pp. 73–97. [Google Scholar]
- Desai, D.; Ghadge, A.; Wazare, R.; Bagade, J. A Comparative Study of Information Retrieval Models for Short Document Summaries. Lect. Notes Data Eng. Commun. Technol. 2022, 75, 547–562. [Google Scholar] [CrossRef]
- Boukhari, K.; Omri, M.N. Approximate Matching-Based Unsupervised Document Indexing Approach: Application to Biomedical Domain. Scientometrics 2020, 124, 903–924. [Google Scholar] [CrossRef]
- Gabsi, I.; Kammoun, H.; Souidi, D.; Amous, I. MeSH-Based Semantic Weighting Scheme to Enhance Document Indexing: Application on Biomedical Document Classification. J. Inf. Knowl. Manag. 2024, 2450035. [Google Scholar] [CrossRef]
- Mouriño García, M.A.; Pérez Rodríguez, R.; Anido Rifón, L. Wikipedia-Based Cross-Language Text Classification. Inf. Sci. 2017, 406–407, 12–28. [Google Scholar] [CrossRef]
- Antonio Mouriño García, M.; Pérez Rodríguez, R.; Anido Rifón, L. Leveraging Wikipedia Knowledge to Classify Multilingual Biomedical Documents. Artif. Intell. Med. 2018, 88, 37–57. [Google Scholar] [CrossRef] [PubMed]
- Chandwani, G.; Ahlawat, A.; Dubey, G. An Approach for Document Retrieval Using Cluster-Based Inverted Indexing. J. Inf. Sci. 2023, 49, 726–739. [Google Scholar] [CrossRef]
- Inje, B.; Nagwanshi, K.K.; Rambola, R.K. An Efficient Document Information Retrieval Using Hybrid Global Search Optimization Algorithm with Density Based Clustering Technique. Cluster Comput. 2023, 27, 689–705. [Google Scholar] [CrossRef]
- Costa, W.; Pedrosa, G.V. A Textual Representation Based on Bag-of-Concepts and Thesaurus for Legal Information Retrieval. In Proceedings of the Symposium on Knowledge Discovery, Mining and Learning, Brasilia, Brazil, 28 November–1 December 2022; pp. 114–121. [Google Scholar]
- Ouadif, L.; El Ayachi, R.; Biniz, M. A New Approach of Documents Indexing Using Subject Modelling and Summarization. J. Phys. Conf. Ser. Int. Conf. Math. Data Sci. (ICMDS) 2020, 1743, 012032. [Google Scholar] [CrossRef]
- Khalloufi, R.; El Ayachi, R.; Biniz, M.; Fakir, M.; Sarfraz, M. An Approach of Documents Indexing Using Summarization. In Critical Approaches to Information Retrieval Research; Sarfraz, M., Ed.; IGI Global: Hershey, PA, USA, 2020; pp. 78–86. [Google Scholar]
- Bostan, S.; Bidoki, A.M.Z.; Pajoohan, M.-R. Improving Ranking Using Hybrid Custom Embedding Models on Persian Web. J. Web Eng. 2023, 2, 797–820. [Google Scholar] [CrossRef]
- Gang, L.; Huanbin, Z.; Tongzhou, Z. Document Vector Representation with Enhanced Features Based on Doc2VecC. Mob. Netw. Appl. 2023, 1–10. [Google Scholar] [CrossRef]
- Mikolov, T.; Corrado, G.; Chen, K.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1 January 2014; pp. 1532–1543. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Google, K.T.; Language, A.I. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Lee, H.; Lee, S.; Lee, I.; Nam, H. AMP-BERT: Prediction of Antimicrobial Peptide Function Based on a BERT Model. Protein Sci. 2023, 32, 1–13. [Google Scholar] [CrossRef]
- Müller, M.; Salathé, M.; Kummervold, P.E. COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. Front. Artif. Intell. 2023, 6, 1023281. [Google Scholar] [CrossRef]
- Dai, Z.; Callan, J. Context-Aware Term Weighting for First Stage Passage Retrieval. In Proceedings of the SIGIR ’20: 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Xi’an, China, 25–30 July 2020; pp. 1533–1536. [Google Scholar]
- Suominen, O. Annif: DIY Automated Subject Indexing Using Multiple Algorithms. Lib. Q. J. Assoc. Eur. Res. Libr. 2019, 29, 1–25. [Google Scholar] [CrossRef]
- Suominen, O.; Inkinen, J.; Lehtinen, M. Annif and Finto AI: Developing and Implementing Automated Subject Indexing. JLIS.it 2022, 13, 265–282. [Google Scholar] [CrossRef]
- Liu, E.; Cui, C.; Zheng, K.; Neubig, G. Testing the Ability of Language Models to Interpret Figurative Language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 4437–4452. [Google Scholar]
- Yenduri, G.; Ramalingam, M.; Selvi, G.C.; Supriya, Y.; Srivastava, G.; Maddikunta, P.K.R.; Raj, G.D.; Jhaveri, R.H.; Prabadevi, B.; Wang, W.; et al. GPT (Generative Pre-Trained Transformer)—A Comprehensive Review on Enabling Technologies, Potential Applications, Emerging Challenges, and Future Directions. IEEE Access 2024, 12, 54608–54649. [Google Scholar] [CrossRef]
- Collins, E.; Ghahramani, Z. LaMDA: Our Breakthrough Conversation Technology. Available online: https://blog.google/technology/ai/lamda/ (accessed on 26 July 2024).
- Meta Introducing Llama 3.1: Our Most Capable Models to Date. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 29 July 2024).
- Wang, L.; Chen, R. Knowledge-Guided Prompt Learning for Few-Shot Text Classification. Electronics 2023, 12, 1486. [Google Scholar] [CrossRef]
- Saleem, M.; Kim, J. Intent Aware Data Augmentation by Leveraging Generative AI for Stress Detection in Social Media Texts. PeerJ Comput. Sci. 2024, 10, 1–22. [Google Scholar] [CrossRef]
- Alderazi, F.; Algosaibi, A.; Alabdullatif, M. Generative Artificial Intelligence in Topic- Sentiment Classification for Arabic Text: A Comparative Study with Possible Future Directions. PeerJ Comput. Sci. 2024, 10, 1–27. [Google Scholar] [CrossRef]
- Lu, R.S.; Lin, C.C.; Tsao, H.Y. Empowering Large Language Models to Leverage Domain-Specific Knowledge in E-Learning. Appl. Sci. 2024, 14, 5264. [Google Scholar] [CrossRef]
- Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics 2024, 13, 1361. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, New York, NY, USA, 6–12 December 2020; pp. 9459–9474. [Google Scholar]
- Li, R.; Liu, M.; Xu, D.; Gao, J.; Wu, F.; Zhu, L. A Review of Machine Learning Algorithms for Text Classification. Cyber Security; Lu, W., Zhang, Y., Wen, W., Yan, H., Li, C., Eds.; Springer Nature: Singapore, 2022; pp. 226–234. [Google Scholar]
- Munir, K.; Sheraz Anjum, M. The Use of Ontologies for Effective Knowledge Modelling and Information Retrieval. Appl. Comput. Inform. 2018, 14, 116–126. [Google Scholar] [CrossRef]
- OpenAI GPT 3.5 Turbo. Available online: https://platform.openai.com/docs/models/gpt-3-5-turbo (accessed on 28 June 2024).
- OpenAI GPT-4o. Available online: https://platform.openai.com/docs/models/gpt-4o (accessed on 22 July 2024).
- Sharma, A. 11 Best Generative AI Tools and Platforms. Available online: https://www.turing.com/resources/generative-ai-tools (accessed on 6 July 2024).
- Kothari, S. Top Generative AI Tools: Boost Your Creativity. Available online: https://www.simplilearn.com/tutorials/artificial-intelligence-tutorial/top-generative-ai-tools (accessed on 6 July 2024).
- Techvify Team GPT-3.5 vs. GPT-4: Exploring Unique AI Capabilities. Available online: https://techvify-software.com/gpt-3-5-vs-gpt-4/ (accessed on 14 July 2024).
- Prakoso, D.W.; Abdi, A.; Amrit, C. Short Text Similarity Measurement Methods: A Review. Soft Comput. 2021, 25, 4699–4723. [Google Scholar] [CrossRef]
- Kaggle Kaggle Datasets. Available online: https://www.kaggle.com/datasets (accessed on 25 July 2024).
- GENSIM. Available online: https://radimrehurek.com/gensim/models/word2vec.html (accessed on 25 July 2024).
- Dal Pont, T.R.; Sabo, I.C.; Hübner, J.F.; Rover, A.J. Impact of Text Specificity and Size on Word Embeddings Performance: An Empirical Evaluation in Brazilian Legal Domain. In Proceedings of the Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, 20–23 October 2020; Proceedings, Part I. Springer Verlag: Berlin/Heidelberg, Germany, 2020; pp. 521–535. [Google Scholar]
Features | ChatGPT-3.5 | ChatGPT-4o | Gemini 1.5 Pro | Claude 3.5 Sonnet |
---|---|---|---|---|
LLM | GPT-3.5 | GPT-4o | Google Gemini | Claude 3.5 |
Number of parameters | 175 Billion [73] | >1 Trillion [73] | Unknown | Unknown |
Context window size 1 | 4096 tokens [69] (≈3154 words) | 128,000 tokens [70] | 128,000 tokens 2 [29] | 200K tokens (≈150,000 words) [28] |
Inputs | Text | Text, Images, documents (Word documents, Plain text files, PDFs, several source code file formats) | Text, Images | Text, Images, Documents (Word documents, Plain text files, PDFs, Spreadsheets, several source code file formats) |
Knowledge cutoff | September 2021 | October 2023 | Early 2023 | April 2024 |
API Access for free tier | API key generation (after sign up), limited number of requests and tokens per minute | API key generation (after sign up), limited number of requests and tokens per minute | API key generation (after sign up), pay-as-you-go pricing for usage | Need approval from the commercial team to obtain an API key, pay-as-you-go pricing for usage |
Tests | ChatGPT-3.5 | ChatGPT-4o | Gemini-1.5 | Claude-3.5 | |
---|---|---|---|---|---|
Test 1: Request for n relevant terms | Number of suggested terms | 15 to 20 | 20 to 25 | 11 to 18 | 16 to 25 |
Number of shared terms | 6 to 15 | ||||
Relevancy 1 | 99% | 99% | 99% | 99% | |
Test 2: Request for n relevant terms with n ≥ 30 | Number of suggested terms | 30 to 35 | 30 to 45 | 27 to 30 | 30 to 35 |
Number of shared terms | 15 to 21 | ||||
Relevancy | 99% | 99% | 97% | 99% | |
Test 3: Request for n relevant terms with n ≥ 60 | Number of suggested terms | 60 to 65 | 60 to 70 | 18 to 26 | 60 |
Number of shared terms | 9 to 23 | ||||
Relevancy | 99% | 99% | 98% | 98% | |
Test 4: Request for n relevant terms with n < 45 | Number of suggested terms | 20 to 35 | 20 to 43 | 14 to 29 | 20 to 32 |
Number of shared terms | 8 to 14 | ||||
Relevancy | 99% | 99% | 99% | 99% |
Results | |
---|---|
Total number of index terms 1 | 33,217 |
Number of unique suggested terms (with GPT-4o) | 32,546 |
Average number of suggested terms (with GPT-4o) per document 2 | 22.41 |
Results | |
---|---|
Total number of index terms 1 | 12,052 |
Average number of index terms per document 2 | 10.24 |
Results | |
---|---|
Total number of index terms 1 | 30,953 |
Number of unique suggested terms by FintoAI | 29,814 |
Average number of index terms per document 2 | 20 |
Indexing Approach | Av Precision | Av Recall | Av F1 |
---|---|---|---|
Traditional | 0.556 | 0.334 | 0.418 |
FintoAI-based | 0.593 | 0.582 | 0.588 |
GenAI-based (GPT-4o) | 0.884 | 0.928 | 0.905 |
Indexing Approach | Av Precision | Av Recall | Av F1 |
---|---|---|---|
Traditional | 0.616 | 0.445 | 0.517 |
FintoAI-based | 0.723 | 0.566 | 0.635 |
GenAI-based (GPT-4o) | 0.895 | 0.915 | 0.905 |
Queries | Av Precision | Av Recall | Av F1 |
---|---|---|---|
Finance-related queries | 0.770 | 0.818 | 0.792 |
COVID-19-related queries | 0.837 | 0.853 | 0.844 |
Sports-related queries | 0.791 | 0.786 | 0.788 |
Finance-related queries using word embeddings | 0.855 | 0.805 | 0.828 |
COVID-19-related queries using word embeddings | 0.861 | 0.879 | 0.870 |
Sports-related queries using word embeddings | 0.813 | 0.811 | 0.811 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bouzid, S.; Piron, L. Leveraging Generative AI in Short Document Indexing. Electronics 2024, 13, 3563. https://doi.org/10.3390/electronics13173563
Bouzid S, Piron L. Leveraging Generative AI in Short Document Indexing. Electronics. 2024; 13(17):3563. https://doi.org/10.3390/electronics13173563
Chicago/Turabian StyleBouzid, Sara, and Loïs Piron. 2024. "Leveraging Generative AI in Short Document Indexing" Electronics 13, no. 17: 3563. https://doi.org/10.3390/electronics13173563
APA StyleBouzid, S., & Piron, L. (2024). Leveraging Generative AI in Short Document Indexing. Electronics, 13(17), 3563. https://doi.org/10.3390/electronics13173563