Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification
Abstract
:1. Introduction
2. Related Work
- Comparative analysis across techniques. Our study rigorously compares various vectorization and topic classification methods, including trainable/fine-tunable, memory-based, and generative approaches.
- Comparative analysis across multilingual and crosslingual scenarios.
- Addressing closed-set vs. open-set topic classification problems.
- Incorporating additional clustering mechanisms. Besides identifying the emergence of a new topic, we introduce an additional clustering mechanism to differentiate between multiple new topics and accurately label them.
- Integration into our media monitoring systems. We integrate our trained models and clustering module into the real media monitoring system, enabling its practical use and further testing with real clients.
- Open access dataset. To foster continued research and advancement in the field, we make our datasets publicly available for further investigation.
- How do various vectorization and topic classification methods, including trainable/fine-tunable, memory-based, and generative approaches, compare in terms of effectiveness and accuracy in multilingual and crosslingual scenarios?
- Which solutions are effective in addressing open-set topic classification problems within the context of media monitoring, and to what extent are they effective in terms of accuracy?
3. Formal Definition of the Solving Problem
4. Dataset
5. Approaches
- BERT+CNN. Utilizing the BERT (Bidirectional Encoder Representations from Transformers) model [35] for vectorization, in conjunction with a CNN (Convolutional Neural Network) classifier [36], presents multiple benefits. BERT effectively encodes the semantics of individual words within their contextual environments. These encoded word representations are subsequently input into the CNN. The CNN’s role is to delineate features from the n-gram patterns of these embeddings, enabling the identification of keywords or phrases within the text. This capability is particularly advantageous for topic detection. Our CNN’s structure is similar to that shown in Figure 3 of [37]. For BERT, the bert-base-multilingual-cased model was chosen (https://huggingface.co/bert-base-multilingual-cased (accessed on 2 May 2022)), which supports 104 languages and is case-sensitive. The BERT+CNN training involved tuning approximately 735 thousand parameters, suggesting that this method could either solve our current problem effectively or establish a strong baseline.
- XLM-R_fine_tuning. For our classification problem, we employed the XLM-R model as described in [38], making several targeted adjustments. All layers of the model were unfrozen, rendering approximately 278 million parameters trainable. Specifically, we utilized the xlm-roberta-base model (https://huggingface.co/xlm-roberta-base (accessed on 3 March 2023)), a multilingual variant that supports 100 languages.
- LaBSE+FFNN. The LaBSE model [34], a vectorizer, was used in combination with an FFNN classifier. Unlike BERT embeddings, which provide word-level representations, LaBSE specializes in sentence-level embeddings, vectorizing entire texts into single aggregated vectors. This approach effectively handles varying sentence structures with identical semantic meanings, making it ideal for languages featuring flexible sentence structures. The LaBSE model (https://huggingface.co/sentence-transformers/LaBSE (accessed on 3 March 2023)) supports 109 languages and is both multilingual and crosslingual. The sentence vectors generated by LaBSE are input into the FFNN. The FFNN’s architecture and hyperparameters were optimized using Hyperas (https://github.com/maxpumperla/hyperas (accessed on 3 March 2023)) and Hyperopt, which are Python libraries for hyperparameter optimization. We employed the Tree-structured Parzen Estimator (TPE) optimization algorithm, conducting 200 optimization trials to determine optimal settings, including discrete parameters like neuron counts and activation functions, continuous parameters like dropout rates, and conditional parameters such as additional FFNN layers. The training process required adjustments to approximately 1.3 thousand parameters.
- LaBSE_fine_tuning. This process entailed two primary steps: (1) unfreezing all layers of LaBSE and optimizing all parameters, and (2) introducing an additional layer specifically tailored to our classification problem. We utilized the LaBSE2 model (https://tfhub.dev/google/LaBSE/2 (accessed on 3 March 2023)) for fine-tuning purposes. In contrast to the LaBSE+FFNN method, where all LaBSE layers remained frozen, the fine-tuning strategy here involved the adjustment of approximately 490 million parameters.
- LaBSE_LangChain_k1. This method employs the LangChain framework (https://python.langchain.com/ (accessed on 3 March 2023)) designed to build context-aware applications utilizing large language models. It operates based on memory and similarity: training instances are vectorized (omitting validation instances), and a semantic search is performed to find the most similar training instance to the instance under test using cosine similarity. The classification of the test instance is determined by the class of the closest training instance identified in this search. Vectorization is conducted using the LaBSE model.
- LaBSE_LangChain_k10_mv. This approach is an extension of the LaBSE_LangChain_k1 method. It is configured to retrieve the 10 most similar instances rather than just one. A majority voting mechanism is then employed, where the most frequently occurring class among these top 10 instances determines the class label for the testing instance.
- ADA_LangChain_k1. This method is analogous to LaBSE_LangChain_k1, but utilizes OpenAI’s text-embedding-ada-002 model [39] for vectorization instead of the LaBSE model.
- The ADA_LangChain_k10_mv approach uses the same vectorization model as in ADA_LangChain_k1 and follows the methodology of LaBSE_LangChain_k10_mv.
- Davinci_fine_tuning. This generative-based approach employs the Davinci model as discussed in [40]. We configured the model to generate only the first token corresponding to the class label. In our experiments, the Davinci-002 version was fine-tuned using both the training and validation datasets, with all hyperparameters maintained at their default settings.
6. Experiments and Results
- LaBSE vectors.
- Softmax values for each of the 18 known classes were obtained from the multi-class classification model (specifically, LaBSE+FFNN, which achieved the best results with the MM18x58 dataset) and were used as the feature vector.
- Penultimate layer’s values are taken from the same model that provides the softmax values, but are extracted before the application of the softmax function.
- Cosine similarities to the cluster centers of all 18 known classes were used as feature vectors. These cluster centers were calculated by averaging all LaBSE instance vectors belonging to each of these classes.
- Concatenation (denoted as “+”) of various features.
7. System’s Architecture
- Vectorization. The test text, denoted as , is vectorized using the LaBSE vectorization model , resulting in the vector output . Specifically, .
- Binary classification. The vector is then passed into the binary classification model , which returns the sigmoid function value from the interval . If the output of , the text is labeled with the known (i.e., stable) class (denoted in Table 1); otherwise, it is labeled as the new class.
- Multi-class classification (conditional). If , then is passed to the multi-class classification model . This model returns pairs of class labels and their softmax probabilities: , where is the class with the highest probability, irrespective of the threshold. The pairs are ordered such that for all j. In our system, the threshold value is arbitrarily set at 0.3, allowing a maximum of three classes to be determined.
- Clustering (conditional). If , the input text and its vector are passed into the clustering mechanism:
- (a)
- Data storage. The incoming instance is added to the dataset: , and its vector to : . The text is machine translated into English and stored in : . The machine translation is performed using the Googletrans Python library.
- (b)
- Check against existing clusters. The algorithm checks whether can be assigned to one of n clusters , each with names represented as keyword collections . Each cluster has a centroid , calculated as the mean vector of all instances belonging to it. The instance is assigned to the cluster for which the Euclidean distance is minimal, yet below the threshold : . Here, serves as the maximum radius, corresponding to the largest Euclidean distance between any instance in the training dataset and the centroid of its known class, as listed in Table 1. This threshold ensures that the dimensions of new formed clusters approximate those of the known classes. If is successfully assigned to cluster , then the cluster’s name, , is returned.
- (c)
- Reclustering. If cannot be assigned to any of the current n clusters, this number is incremented by 1, resulting in . This new number is used as the argument for the number of clusters in the K-means clustering algorithm [43]. The repeats parameter is set to 25 to minimize the impact of initial centroid placements; all other parameters are set to their defaults. The clustering is performed using the nltk.cluster Python library. During clustering, all instances from , including their vectors from and their English translations from , are organized into new, non-overlapping clusters .
- (d)
- Naming of clusters. For each cluster , all English texts belonging to that cluster are concatenated into a single text document : . These texts are then passed into the KeyBERT model . This model returns either one keyword with a cosine similarity score above 0.5 or the three keywords with the highest similarity scores. Specifically, the output is defined as follows:In our implementation, KeyBERT utilizes BERT embeddings with the keyphrase n-gram range of one or two words; all other parameters are set to their defaults. After reclustering and renaming, all stored instances, including , are relabeled with the new cluster names . The cluster name , to which is assigned, is then returned.
8. Discussion
- How effective is the best solution to this text classification problem?
- Which scenario—multilingual or crosslingual—is more suitable for solving our problem? This will determine whether machine translation into English is necessary or whether methods can be applied directly to the original texts in various languages.
- Which predicted categories are the least accurately predicted, and why are these shortcomings?
- Which type (i.e., trainable/fine-tunable, memory-based, or generative) and which of the tested approaches are most recommended for solving our problem, and which are not?
- How accurately can the binary classifier distinguish between known and new topics?
- What is the optimal feature type for the binary classifier, and why are other types less effective?
- How does the performance of the open-class classifier vary across specific topics?
9. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Harro-Loit, H.; Eberwein, T. News Media Monitoring Capabilities in 14 European Countries: Problems and Best Practices. Media Commun. 2024, 12. [Google Scholar] [CrossRef]
- Grizāne, A.; Isupova, M.; Vorteil, V. Social Media Monitoring Tools: An In-Depth Look; NATO Strategic Communications Centre of Excellence: Riga, Latvia, 2022. [Google Scholar]
- Steinberger, R.; Ehrmann, M.; Pajzs, J.; Ebrahim, M.; Steinberger, J.; Turchi, M. Multilingual Media Monitoring and Text Analysis—Challenges for Highly Inflected Languages. In Proceedings of the Text, Speech, and Dialogue, Pilsen, Czech Republic, 1–5 September 2013; Habernal, I., Matoušek, V., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 22–33. [Google Scholar]
- Steinberger, R. Multilingual and Cross-Lingual News Analysis in the Europe Media Monitor (EMM); Spinger: Berlin/Heidelberg, Germany, 2013; pp. 1–4. [Google Scholar] [CrossRef]
- Steinberger, R.; Ombuya, S.; Kabadjov, M.; Pouliquen, B.; Della Rocca, L.; Belyaeva, E.; De Paola, M.; Ignat, C.; Van Der Goot, E. Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili. Lang. Resour. Eval. 2011, 45, 311–330. [Google Scholar] [CrossRef]
- Pajzs, J.; Steinberger, R.; Ehrmann, M.; Ebrahim, M.; Della Rocca, L.; Bucci, S.; Simon, E.; Váradi, T. Media monitoring and information extraction for the highly inflected agglutinative language Hungarian. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 2049–2056. [Google Scholar]
- Thurman, N.; Hensmann, T. Social Media Monitoring Apps in News Work: A Mixed-Methods Study of Professional Practices and Journalists’ and Citizens’ Opinions. Available online: https://ssrn.com/abstract=4393018 (accessed on 5 February 2024).
- Perakakis, E.; Mastorakis, G.; Kopanakis, I. Social Media Monitoring: An Innovative Intelligent Approach. Designs 2019, 3, 24. [Google Scholar] [CrossRef]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.G.; Salakhutdinov, R.; Le, Q.V. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Alcoforado, A.; Ferraz, T.P.; Gerber, R.; Bustos, E.; Oliveira, A.S.; Veloso, B.M.; Siqueira, F.L.; Costa, A.H.R. ZeroBERTo: Leveraging Zero-Shot Text Classification by Topic Modeling. In Proceedings of the Computational Processing of the Portuguese Language, Fortaleza, Brazil, 21–23 March 2022; Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., Pinto, H., Eds.; Springer: Cham, Switzerland, 2022; pp. 125–136. [Google Scholar]
- Liu, C.; Zhang, W.; Chen, G.; Wu, X.; Luu, A.T.; Chang, C.H.; Bing, L. Zero-Shot Text Classification via Self-Supervised Tuning. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 1743–1761. [Google Scholar] [CrossRef]
- Ebrahimi, A.; Mager, M.; Oncevay, A.; Chaudhary, V.; Chiruzzo, L.; Fan, A.; Ortega, J.; Ramos, R.; Rios, A.; Meza Ruiz, I.V.; et al. AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 6279–6299. [Google Scholar] [CrossRef]
- Song, Y.; Upadhyay, S.; Peng, H.; Mayhew, S.; Roth, D. Toward any-language zero-shot topic classification of textual documents. Artif. Intell. 2019, 274, 133–150. [Google Scholar] [CrossRef]
- Mutuvi, S.; Boros, E.; Doucet, A.; Jatowt, A.; Lejeune, G.; Odeo, M. Multilingual Epidemiological Text Classification: A Comparative Study. In Proceedings of the 28th International Conference on Computational Linguistics, Virtual, 8–13 December 2020; pp. 6172–6183. [Google Scholar] [CrossRef]
- Wang, C.; Banko, M. Practical Transformer-based Multilingual Text Classification. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Virtual, 6–11 June 2021. [Google Scholar]
- Dhananjaya, V.; Demotte, P.; Ranathunga, S.; Jayasena, S. BERTifying Sinhala—A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 7377–7385. [Google Scholar]
- Manias, G.; Mavrogiorgou, A.; Kiourtis, A.; Symvoulidis, C.; Kyriazis, D. Text categorization and sentiment analysis: A comparative analysis of the utilization of multilingual approaches for classifying twitter data. Neural Comput. Appl. 2023, 35, 21415–21431. [Google Scholar] [CrossRef] [PubMed]
- Barbieri, F.; Espinosa Anke, L.; Camacho-Collados, J. XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 258–266. [Google Scholar]
- Kapočiūtė-Dzikienė, J.; Salimbajevs, A.; Skadiņš, R. Monolingual and Cross-Lingual Intent Detection without Training Data in Target Languages. Electronics 2021, 10, 1412. [Google Scholar] [CrossRef]
- Shi, L.; Mihalcea, R.; Tian, M. Cross Language Text Classification by Model Translation and Semi-Supervised Learning. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, USA, 9–11 October 2010; pp. 1057–1067. [Google Scholar]
- Karamanolakis, G.; Hsu, D.; Gravano, L. Cross-Lingual Text Classification with Minimal Resources by Transferring a Sparse Teacher. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 3604–3622. [Google Scholar] [CrossRef]
- Xu, R.; Yang, Y. Cross-lingual Distillation for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1: Long Papers, pp. 1415–1425. [Google Scholar] [CrossRef]
- Dong, X.; de Melo, G. A Robust Self-Learning Framework for Cross-Lingual Text Classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 6306–6310. [Google Scholar] [CrossRef]
- Chen, X.; Awadallah, A.H.; Hassan, H.; Wang, W.; Cardie, C. Multi-Source Cross-Lingual Model Transfer: Learning What to Share. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 3098–3112. [Google Scholar] [CrossRef]
- Xu, W.; Haider, B.; Mansour, S. End-to-End Slot Alignment and Recognition for Cross-Lingual NLU. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5052–5063. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, X.; Yang, P.; Liu, S.; Wang, Z. Cross-lingual Text Classification with Heterogeneous Graph Neural Network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Online, 1–6 August 2021; Volume 2: Short Papers, pp. 612–620. [Google Scholar] [CrossRef]
- Barnes, J. Sentiment and Emotion Classification in Low-resource Settings. In Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Toronto, ON, Canada, 14 July 2023; pp. 290–304. [Google Scholar] [CrossRef]
- Nishikawa, S.; Yamada, I.; Tsuruoka, Y.; Echizen, I. A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 1–12. [Google Scholar] [CrossRef]
- Yang, Z.; Cui, Y.; Chen, Z.; Wang, S. Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training. arXiv 2022, arXiv:2202.13654. [Google Scholar]
- Prakhya, S.; Venkataram, V.; Kalita, J. Open Set Text Classification Using CNNs. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), Kolkata, India, 18–21 December 2017; pp. 466–475. [Google Scholar]
- Bendale, A.; Boult, T.E. Towards Open Set Deep Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1563–1572. [Google Scholar]
- Yang, Z.; Emmert-Streib, F. Optimal performance of Binary Relevance CNN in targeted multi-label text classification. Knowl.-Based Syst. 2024, 284, 111286. [Google Scholar] [CrossRef]
- Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT Sentence Embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 878–891. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Kapočiūtė-Dzikienė, J.; Balodis, K.; Skadiņš, R. Intent Detection Problem Solving via Automatic DNN Hyperparameter Optimization. Appl. Sci. 2020, 10, 7426. [Google Scholar] [CrossRef]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
- Greene, R.; Sanders, T.; Weng, L.; Neelakantan, A. New and Improved Embedding Model. Available online: https://openai.com/blog/new-and-improved-embedding-model (accessed on 15 December 2022).
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the 34th International Conference on Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: New York, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- Gosset, W.S. The Probable Error of a Mean. Biometrika 1908, 6, 1–25. [Google Scholar]
- Ross, A.; Willson, V.L. One-Sample T-Test. In Basic and Advanced Statistical Tests: Writing Results Sections and Creating Tables and Figures; SensePublishers: Rotterdam, The Netherlands, 2017; pp. 9–12. [Google Scholar] [CrossRef]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
Topic | Training | Validation | Testing |
---|---|---|---|
1. AlternativeEnergy | 10 | 19 | 14 |
2. ClimateAction | 10 | 14 | 23 |
3. ClimateChange | 475 | 484 | 358 |
4. CoronavirusInfection | 71 | 108 | 95 |
5. CybersecurityAntifraud | 62 | 118 | 122 |
6. Drugs | 18 | 39 | 39 |
7. EUEconomy | 28 | 28 | 23 |
8. EUInternet | 13 | 25 | 16 |
9. EnergyMarketsandStrategies | 6 | 10 | 1 |
10. EuropeanCouncil | 107 | 174 | 111 |
11. Europol | 169 | 432 | 233 |
12. FightagainstFraud | 21 | 24 | 38 |
13. FinancialEconomicCrime | 6 | 9 | 15 |
14. ForgeryMoney | 2 | 3 | 4 |
15. InformationSecurity | 26 | 31 | 56 |
16. Migration | 49 | 65 | 67 |
17. NATO | 3516 | 3865 | 3415 |
18. TerroristAttack | 92 | 117 | 125 |
In total: | 4681 | 5565 | 4755 |
Category | Training | Validation | Testing |
---|---|---|---|
Known (negative) | 4681 | 5565 | 4755 |
New (positive) | 11,350 | 13,665 | 13,647 |
In total: | 16,031 | 19,230 | 18,402 |
Features | Avg. Loss ± Confidence Int. |
---|---|
LaBSE vectors | 0.0017 ± 0.0002 |
1. Softmax values | 0.0346 ± 0.0014 |
2. Penultimate layer’s values | 0.1101 ± 0.0596 |
3. Cosine similarities | 0.0049 ± 0.0077 |
1 + 2 | 0.0661 ± 0.0850 |
1 + 3 | 0.0035 ± 0.0031 |
2 + 3 | 0.0137 ± 0.0334 |
1 + 2 + 3 | 0.0057 ± 0.0036 |
Class X | Numb. of Tested Instances | Accuracy |
---|---|---|
1. AlternativeEnergy | 43 | 1.000 |
2. ClimateAction | 47 | 1.000 |
3. ClimateChange | 1317 | 0.992 |
4. CoronavirusInfection | 274 | 0.971 |
5. CybersecurityAntifraud | 302 | 0.990 |
6. Drugs | 96 | 0.906 |
7. EUEconomy | 79 | 0.941 |
8. EUInternet | 54 | 1.000 |
9. EnergyMarketsandStrategies | 17 | 1.000 |
10. EuropeanCouncil | 392 | 0.980 |
11. Europol | 834 | 0.978 |
12. FightagainstFraud | 83 | 0.940 |
13. FinancialEconomicCrime | 30 | 1.000 |
14. ForgeryMoney | 9 | 1.000 |
15. InformationSecurity | 113 | 1.000 |
16. Migration | 181 | 0.972 |
17. NATO | 10,796 | 0.796 |
18. TerroristAttack | 334 | 0.997 |
Topic | Topic | Cosine Similarity |
---|---|---|
Top 10 topic pairs with the highest cosine similarity | ||
ClimateAction | ClimateChange | 0.836 |
EUEconomy | EuropeanCouncil | 0.802 |
AlternativeEnergy | EnergyMarketsandStrategies | 0.791 |
CybersecurityAntifraud | InformationSecurity | 0.713 |
EUEconomy | EnergyMarketsandStrategies | 0.711 |
FightagainstFraud | FinancialEconomicCrime | 0.702 |
EUEconomy | EUInternet | 0.694 |
EUInternet | EuropeanCouncil | 0.690 |
EuropeanCouncil | Europol | 0.677 |
CybersecurityAntifraud | EUInternet | 0.635 |
Last 10 topic pairs with the lowest cosine similarity | ||
ClimateChange | Drugs | 0.234 |
EUEconomy | FightagainstFraud | 0.231 |
ClimateAction | Drugs | 0.229 |
AlternativeEnergy | CoronavirusInfection | 0.225 |
AlternativeEnergy | FightagainstFraud | 0.220 |
AlternativeEnergy | ForgeryMoney | 0.216 |
EuropeanCouncil | FightagainstFraud | 0.206 |
EnergyMarketsandStrategies | FightagainstFraud | 0.203 |
ClimateAction | ForgeryMoney | 0.180 |
FightagainstFraud | NATO | 0.099 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kapočiūtė-Dzikienė, J.; Ungulaitis, A. Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification. Appl. Sci. 2024, 14, 4320. https://doi.org/10.3390/app14104320
Kapočiūtė-Dzikienė J, Ungulaitis A. Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification. Applied Sciences. 2024; 14(10):4320. https://doi.org/10.3390/app14104320
Chicago/Turabian StyleKapočiūtė-Dzikienė, Jurgita, and Arūnas Ungulaitis. 2024. "Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification" Applied Sciences 14, no. 10: 4320. https://doi.org/10.3390/app14104320
APA StyleKapočiūtė-Dzikienė, J., & Ungulaitis, A. (2024). Towards Media Monitoring: Detecting Known and Emerging Topics through Multilingual and Crosslingual Text Classification. Applied Sciences, 14(10), 4320. https://doi.org/10.3390/app14104320