Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis
Abstract
:1. Introduction
2. Analysis of Existing Datasets
3. Problem Undertaken
3.1. Problem Definition
- 1.
- Identification Problem: Define T as a non-empty set, , where each represents an individual tweet. Define P as a subset of T, exhibits peculiar communication patterns as defined by a set of pragmatic criteria . The problem is to determine a function such that for any tweet , if , and otherwise.
- 2.
- Curation Problem: Upon identification, curate P into a structured dataset D, where D is a finite ordered list of elements from P, , and . This curation must be performed using pragmatic behaviour analysis, which includes the identification of indirect speech acts, figurative language, and other relevant pragmatic phenomena without the employment of advanced analytical techniques such as data mining, machine learning, or deep learning.
- 3.
- Categorial Problem: Define a category that quantifies based on the principles of deep pragmatic analysis of tweets within P. The category m should be able to measure the prevalence of the peculiar communication patterns within T in a transparent and reproducible manner.
- 4.
- Application Problem: Identify potential applications A of the curated dataset D, where A may include, but is not limited to, sentiment analysis s. The sentiment analysis s is a function , where S is the set of possible sentiment categories, and it aims to classify the sentiment of each tweet in P as it pertains to the topic of world war and conflicts.
3.2. Formal Problem Statement
4. Methodology
4.1. Seed Words as Search Filters
4.2. Expanding the Search Filters
Cluster | Keywords |
---|---|
Fear | Terrorism, Invasion, Attack, Nuclear weapons, Aatankwad, Aakraman, Hamla, Paramanu Hathiyaron, Bhayavah, Bhayankar, Khatarnak, Darawana, Aggression, Assault, Offensive, Incursion, Ambush, Raid, Onslaught, Nuclear Threat, Atomic Menace, Radiation Danger, Nuke Hazard, Violence, Hostility, Belligerence, Conflict, Encroachment, Trespassing, Intrusion, Extremism, Fanaticism, Radicalism, Militancy, Insurgency, Rebellion, Uprising, Sabotage, Subversion, Undermining, Treachery, Menace, Peril, Threat, Hazard, Provocation, Incitement, Stimulus, Aggravation, Irritation, Offensiveness, Displeasure, Dislike, Aversion, Menacing, Intimidating, Threatening, Scary, Attack on sovereignty, Offensive on territory, Breach of borders, Infringement on independence, Danger, Risk, Perilousness, Hazardousness, Destabilisation, Turmoil, Upheaval, Chaos, Agony, Distress, Torment, Affliction, Coercion, Duress, Pressure, Panic, Alarm, Hysteria, Fright, Brutality, Ferocity, Savagery, Barbarity, Military attack, Offensive campaign, Atomic Armament, Nuclear Missile, Radiological Weapon, Warhead, Insurrection, Revolt, Mutiny, Coup, Atrocity, Barbarism, Inhumanity, Cruelty, Confrontation, Hostilities, Clashes, Breach of defence, Infringement on protection, Threat to safety, Outrage, Fury, Indignation, Resentment, Intrusive, Obtrusive, Meddlesome, Nosy, Antagonism, Rivalry, Enmity, Assault with arms, Offensive with weapons, Armed attack, Strike with artillery, Radiation hazard, Radioactive peril, Rebellion against government, Uprising against authority, Insurgency against state, Revolt against administration, Human rights abuse, Oppression, Tyranny, Despotism, Battle, War, Combat, Skirmish. |
Hope | Peace, Diplomacy, Resolution, Shanti, Shaanti, Aaram, Kootniti, Kootneeti, Rajneeti, Baatchee, Baatchit, Varta, Samadhan, Samaadhaan, Hal, Umed, Umeed, Asha, Shanti: Shantipurn, Niraashaant, Anand, Aaram: Vishraam, Thaharana, Rehat, Sakoon, Kootneeti/Rajneeti: Satta, Sarkaar, Prabandhan, Vyavastha, Sambhashan, Sandesh, Sanvaad, Prastutikaran, Samadhan/Samaadhaan: Hal, Halvayi, Nivaran, Nirdhaar, Aashirvaad, Aas, Aasha |
Anger | Violence, Aggression, Conflict, Occupation, Intrusion, Struggle, Business, Anger, Opposition, War, Tension, Clash, Change, Retaliation, Fight, Resistance, Competition, Quarrel, Dilemma, Blow, Collision, Right, Expression, Freedom, Independence, Obstruction, Barrier, Battle, Rescue, Oppression, Reaction, Krodh, Hinsa, Pratidwand, Jang, Tanaav, Takraav, Badlaav, Pratikar, Dangal, Pratirodh, Mukabala, Jhagda, Uljhan, Thokar, Takkar, Pratipaksha, Adhikar, Abhivyakti, Mukti, Swatantrata, Roktham, Avrodh, Pratibandh, Samara, Paritran, Utpidan, Yuddh, Pratikriya, Anushasan, Aakramakata, Sangharsh, Vyavasaay. |
Joy | Victory, Liberation, Freedom, Independence, Jeet, Azaadi, Swatantrata, Vijay, Jit, Fateh, Safalta, Baajti, Mukti, Chhoot, Nijaat, Nirmaan, Aatma, Aatma kush, Nirbharata, Swavalamban, Swabhimaan, Adhikar, Pratishodh, Kamyabi, Uddhar, Pragati, Utkarsh, Jayate, Siddhi, Triumph, Conquest, Subjugation, Overcoming, Accomplishment, Elation, Delight, Jubilation, Ecstasy, Exhilaration, Gratitude, Joyousness, Festivity, Rejoicing, Exultation, Thrill, Euphoria, Rapture, Bliss, Gleefulness, Merriment, Jollity, Glee, Cheerfulness, Happiness, Cheer, Enthusiasm, Excitement, Thriving, Flourishing, Advancement, Progression, Growth, Development, Prosperity, Success, Achievements, Triumphalism, Celebrations, Revelry, Exultancy, Fulsomeness, Contentment, Satiety, Gratification, Pleasure, Self-satisfaction, Delightfulness, Gladness, Exuberance, Zeal, Passion, Blissfulness. |
Sad | Loss, Death, Trauma, Suffering, Sadness, Loss, Death, Trauma, Suffering, Haani, Maut, Trauma, Peeda, insecurity, vulnerability, despair, hopelessness, helplessness, sadness, sorrow, anguish, agony, misery, grief, mourning, Asuraksha, Bhay, Nirasha, Niraasha, Niraashrit, Dukh, Dard, Takleef, Bechaini, Udasi, Shok, Dukh, Vyatha, Dard, Duhkh, Matam, Rona, Bebasi, Tanhayi, Virah, Alvida, Tanhaai, Andhera, Ashantata, Vipada, Sangharsh, Kasht, Dardnak, Dardbhara, Duvidha, Asahayta, Samvedana, Sankat, Vilap, Nafrat, Anath, Nirjivta, Ulat Palat, Akshamta, Bebas, Bevakoof, Khafa, Betahasha, Nirasha, Vilamb, Vair, Bhagna Hriday, Dardnaak, Maafi, Mafinama, Nirnay, Nisantaan, Durghatna, Sthayitva, Asthayi, Virahita, Virodh, Ashru, Rulaana, Daridrata, Baychani, Tootna, Bhagna, Durbhagya, Durbalata, Nirlajja, Nirlajjata, Hani, Bayanak, Sunsaan, Vyakulta, Vipatti, Aansoon, Afsos, Afsosnaak, Asafalta, Asafal, Asafalata, Udhas, Udasi, Man-hi-man, Hichaki, Haath-diya, Dar, Udaasi, Lachar |
Surprise | Breakthrough, Diplomatic relations, Unprecedented events, Aashcharyajanak Safalta, Rajneetik Rishte, Anoothi Ghatnaen, Astonishment, miracle, Revelation, Startling, Unexpected, Eye-opener, Stunner, Phenomenon, Thunderbolt, Wonderment, Mystery, Unexpectedness, Shocking, Stupefaction, Serendipity, Epiphany, Mind-blowing, Wondrous, Paradigm shift, Mind-boggling, Unpredictable, Unforeseen, rare, Out of the blue, Remarkable, Unanticipated, Puzzlement, Enigma, Surprise attack, Mind-bending, Revolutionary, Bewilderment, Unusual course of events, Newsworthy, Jolt, Awe-inspiring, Impressive, Catching off guard, Unexpected turn, Mysteriousness, Unexpected twist, Staggering |
Disgust | Genocide, Atrocities, Human rights violations, War crimes, Nafrat, Narsanhaar, Apraadh, Manavaadhikaaron ka Ullekh, Yudh Apraadh, Brutality, Cruelty, Oppression, Injustice, Discrimination, Prejudice, Racism, Homophobia, Xenophobia, Bigotry, Intolerance, Hatred, Loathing, Abomination, Revulsion, Contempt, Disdain, Dislike, Disapproval, Disgust, Abhorrence, Repugnance, Aversion, Antipathy, Odium, Detestation, Despise, Scorn, Malice, Animosity, Hostility, Enmity, Agony, Torment, Aggravation, Malignity, Spitefulness, Vengefulness, Resentment, Bitterness, Displeasure. Discomfort, Discontent, Disquietude, Unease, Annoyance, Irritation, Frustration, Anguish, Misery, Wretchedness, Affliction, Tribulation, Hardship, Suffering, Opprobrium, Shame, Disgrace, Embarrassment, Humiliation, Degradation, Ignominy, Infamy, Scandal, Reproach, contumely, Insult, Defamation, Libel, Slander, Calumny, Falsehood, Deceit, Betrayal, Infidelity, Treason, Perfidy, Duplicity, Fraud, Corruption, Iniquity, Sin, Vice, Immorality, Decadence, Depravity, Aberration, Deviation, Perversion, Lewdness, Obscenity, Profligacy, Impurity, Indecency, Blasphemy, Sacrilege, Profanity, Heresy, Apostasy, Ignorance, Stupidity, Foolishness, Ineptitude, Incompetence, Ineffectiveness, Negligence, Sloth, Procrastination, Apathy, Indifference, Insensitivity, Callousness |
5. Topic Inferring and Content Analysis
5.1. Deep Demostration through Case Study
- 1.
- Contextual Understanding Principle (CUP) This principle emphasises the importance of understanding each word and phrase within its linguistic and cultural context. It’s crucial for analysing code-mixed languages like Hinglish, where cultural idioms and linguistic structures are deeply intertwined and complex.
- 2.
- Emotional and Sentiment Mapping Principle (ESMP) This principle involves identifying and mapping the emotional or sentiment value of words and phrases to understand the overall emotional tone of the text. It’s key in recognizing and categorising emotional words and the broader emotional states they imply.
- 3.
- Lexical and Semantic Analysis Principle (LSAP) This principle focuses on analysing the meaning of words and phrases, both individually and collectively, to understand their semantic roles in the text. It’s essential for suggesting particular themes or topics and understanding the implications of word combinations.
- 4.
- Cultural and Sociolinguistic Relevance Principle (CSRP) This principle recognizes the importance of cultural insights and sociolinguistic factors in interpreting language, particularly in contexts deeply embedded with cultural nuances like Hinglish. It emphasises understanding words beyond their direct translations to include cultural undertones.
- 5.
- Coherence and Cohesion in Text Principle (CCTP) This principle looks at how words and phrases contribute to creating coherent and cohesive messages, themes, or narratives within the text. It’s important for noticing patterns or common themes across different texts to accurately reflect the predominant theme or sentiment of the cluster.
5.2. Inferences and Findings
6. Result and Discussion
- 1.
- The dataset contains tweets that are pertinent to the topic of world wars and conflicts, allowing researchers to concentrate their analysis on tweets that are most relevant to the issue at hand.Currently the world is suffering from multiple conflicts including Ukraine-Russia, Indo-Pakistan, Indo-China, China-Taiwan, North Korea, and the US Cold War, among others.
- 2.
- By compiling the odd tweets into a structured dataset, the research improves the data’s transparency and makes it simpler for other researchers to reproduce the analysis and draw their own conclusions regarding the emotions, opinions and sentiments regarding world war and conflicts and their regional conflicts.
- 3.
- The clustering enables researchers to categorise odd tweets related to world war and conflicts, which can provide insight into how frequently this topic is discussed on Twitter/‘X’ and how it evolves over time.
- 4.
- The dataset is curated using open social media tools and with help of data mining methods the process of curation was completed [49]. This research will provide foundational insights into the communication patterns of Twitter/‘X’ users and help identify indirect acts or figurative language employed in the messages regarding war and conflicts.Filling Data Availability Gaps This research addresses a significant gap in publicly available datasets focusing on emotions in conflict and war scenarios. By providing this resource, the researchers have significantly contributed to the field, enabling other researchers, policymakers, and organizations to understand and respond to the emotional dimensions of conflicts more effectively.Additional Auxiliary Work The contribution extends to auxiliary work that enhances the main research. This includes developing tools for easier dataset access, conducting preliminary studies to validate the dataset’s utility, and engaging with communities and experts for feedback. Such efforts ensure the dataset’s continuous improvement and wider adoption, maximizing its impact.Integration of Deep Pragmatic Analysis with Topic Clustering A distinctive contribution of this research is the enhancement of topic analysis and clustering approaches, typically reliant on algorithms like Latent Dirichlet Allocation (LDA), with the incorporation of Deep Pragmatic Analysis. This integration addresses the limitations of traditional topic modeling techniques that often miss the subtleties of human communication such as sarcasm, metaphor, and context-dependent meanings. Deep Pragmatic Analysis delves deeper into the linguistic and contextual nuances, providing a more sophisticated and accurate understanding of text data, especially in the emotionally charged and complex narratives found in war and conflict contexts. This approach significantly improves the interpretability and usefulness of the topic models, making this a pivotal advancement in the field of emotional data analysis.Overall, the research work stands out for its comprehensive approach to creating a valuable emotional dataset, its rigorous validation process, and its emphasis on practical application in understanding and addressing the human impacts of war and conflicts.
Evaluation and Validation of Work
- 1.
- The judges effectively consider the given annotations for each keyword cluster as appropriate emotional expressions/labels. The average intensity ratings across all raters were high (ranging from 4.14 to 4.29 on a scale of 1 to 5), indicating a strong emotional expression in the labels. The high agreement on primary emotions among raters, as confirmed by the pairwise rater agreement matrix values (positive values such as 0.18 to 1), supports a good level of consistent primary emotion identification.
- 2.
- The judges exhibit high confidence in the process of annotation, expressing a strong belief in the accuracy of the deep pragmatic analysis of five principles-based emotion annotations. The average confidence ratings range from 4.14 to 4.29 (on a scale of 1 to 5), indicating high confidence. The pairwise agreement matrix shows moderate to high consistency among raters (positive values, e.g., 0.18 to 1), as does the histogram of Kappa scores, which indicates a moderate level of agreement between rater pairs. In the end it can be said that there is a high level of agreement among raters regarding emotion intensity and primary emotion identification. This conclusion is supported by the consistency observed in the average ratings and pairwise agreement matrix values.
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
NLTK | Natural Language Toolkit |
DL | Deep Learning |
LDA | Latent Dirichlet Allocation |
Appendix A
- 1.
- Emotion Intensity: On a scale of 1 (not at all) to 5 (extremely strong), how strongly are emotions expressed in this snippet?
- 2.
- Primary Emotion: What is the primary emotion expressed in this snippet? (Choose one from a provided list of emotions covered by the algorithm)
- 3.
- Secondary Emotions (Optional): Are there any additional emotions present, though less prominent? (Select from the list or indicate “None”)
- 4.
- Confidence in Algorithm Annotation: How confident are you that the algorithm’s emotion annotation for this snippet is accurate? (1-Not at all confident, 5-Extremely confident)
- 5.
- Clarity of Explanation (Optional, if the algorithm provides explanations): If the algorithm provides an explanation for its annotation, is it clear and understandable? (1-Not clear at all, 5-Very clear)
- 6.
- Additional Comments (Optional): Do you have any additional comments or observations about this annotation methodology used?
References
- Zimbra, D.; Abbasi, A.; Zeng, D.; Chen, H. The state-of-the-art in twitter sentiment analysis: A review and benchmark evaluation. Acm Trans. Manag. Inf. Syst. 2018, 9, 3185045. [Google Scholar] [CrossRef]
- Tao, W.; Peng, Y. Differentiation and unity: A Cross-platform Comparison Analysis of Online Posts’ Semantics of the Russian–Ukrainian War Based on Weibo and Twitter. Commun. Public 2023, 8, 105–124. [Google Scholar] [CrossRef]
- Zadeh, M.H.; Cicekli, I. Protest Event Analysis: A New Method Based on Twitter’s User Behaviors. Inf. Technol. Control 2023, 52, 457–470. [Google Scholar] [CrossRef]
- Karayiğit, H.; Akdagli, A. BERT-based Transfer Learning Model for COVID-19 Sentiment Analysis on Turkish Instagram Comments. Inf. Technol. Control 2022, 51, 409–428. [Google Scholar] [CrossRef]
- Aldjanabi, W.; Dahou, A.; Al-Qaness, M.A.A.; Elaziz, M.A.; Helmi, A.M.; Damaševičius, R. Arabic offensive and hate speech detection using a cross-corpora multi-task learning model. Informatics 2021, 8, 69. [Google Scholar] [CrossRef]
- Gunasekar, M.; Thilagamani, S. Improved Feature Representation Using Collaborative Network for Cross-Domain Sentiment Analysis. Inf. Technol. Control 2023, 52, 100–110. [Google Scholar] [CrossRef]
- Liang, S.; Jin, J.; Du, W.; Qu, S. A Multi-Channel Text Sentiment Analysis Model Integrating Pre-training Mechanism. Inf. Technol. Control 2023, 52, 263–275. [Google Scholar] [CrossRef]
- Tesfagergish, S.G.; Damaševičius, R.; Kapočiūtė-Dzikienė, J. Deep Fake Recognition in Tweets Using Text Augmentation, Word Embeddings and Deep Learning. In Computational Science and Its Applications—ICCSA 2021: In Proceedings of the 21st International Conference, Cagliari, Italy, 13–16 September 2021; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Cham, Switzerland, 2021; Volume 12954, pp. 523–538. [Google Scholar] [CrossRef]
- Yinka-Banjo, C.; Ugot, O.A.; Misra, S.; Adewumi, A.; Damasevicius, R.; Maskeliunas, R. Conflict resolution via emerging technologies? J. Phys. Conf. Ser. 2019, 1235, 12022. [Google Scholar] [CrossRef]
- Kaur, G.; Pratibha; Kaur, A.; Khurana, M. A Review of Opinion Mining Techniques. ECS Trans. 2022, 107, 10125. [Google Scholar] [CrossRef]
- Tesfagergish, S.G.; Damaševičius, R.; Kapočiūtė-Dzikienė, J. Deep Learning-Based Sentiment Classification of Social Network Texts in Amharic Language. Commun. Comput. Inf. Sci. 2022, 1740, 63–75. [Google Scholar] [CrossRef]
- Maity, K.; Saha, S.; Bhattacharyya, P. Emoji, Sentiment and Emotion Aided Cyberbullying Detection in Hinglish. IEEE Trans. Comput. Soc. Syst. 2022, 10, 2411–2420. [Google Scholar] [CrossRef]
- Srivastava, A.; Hasan, M.; Yagnik, B.; Walambe, R.; Kotecha, K. Role of artificial intelligence in detection of hateful speech for Hinglish data on social media. In Applications of Artificial Intelligence and Machine Learning: Select Proceedings of ICAAAIML 2020; Springer: Singapore, 2021; pp. 83–95. [Google Scholar]
- Kukkar, A.; Mohana, R.; Sharma, A.; Nayyar, A.; Shah, M.A. Improving Sentiment Analysis in Social Media by Handling Lengthened Words. IEEE Access 2023, 11, 9775–9788. [Google Scholar] [CrossRef]
- Sasidhar, T.T.; Premjith, B.; Soman, K. Emotion detection in hinglish (hindi + english) code-mixed social media text. Procedia Comput. Sci. 2020, 171, 1346–1352. [Google Scholar] [CrossRef]
- Gupta, R.; Srivastava, V.; Singh, M. MUTANT: A Multi-sentential Code-mixed Hinglish Dataset. arXiv 2023, arXiv:2302.11766. [Google Scholar]
- Tesfagergish, S.G.; Damaševičius, R.; Kapočiūtė-Dzikienė, J. Deep Learning-based Sentiment Classification in Amharic using Multi-lingual Datasets. Comput. Sci. Inf. Syst. 2023, 20, 1459–1481. [Google Scholar] [CrossRef]
- Cui, J.; Wang, Z.; Ho, S.B.; Cambria, E. Survey on sentiment analysis: Evolution of research methods and topics. Artif. Intell. Rev. 2023, 56, 8469–8510. [Google Scholar] [CrossRef] [PubMed]
- Tan, K.L.; Lee, C.P.; Lim, K.M. A Survey of Sentiment Analysis: Approaches, Datasets, and Future Research. Appl. Sci. 2023, 13, 4550. [Google Scholar] [CrossRef]
- Chan, J.Y.L.; Bea, K.T.; Leow, S.M.H.; Phoong, S.W.; Cheng, W.K. State of the art: A review of sentiment analysis based on sequential transfer learning. Artif. Intell. Rev. 2023, 56, 749–780. [Google Scholar] [CrossRef]
- Das, S.; Singh, T. Sentiment Recognition of Hinglish Code Mixed Data using Deep Learning Models based Approach. In Proceedings of the 13th International Conference on Cloud Computing, Data Science & Engineering (Confluence), Noida, India, 19–20 January 2023; pp. 265–269. [Google Scholar]
- Ledalla, S.; Rao, G.A.; Sesetti, A. Sentiment Analysis of Hinglish Reviews Using Hybrid Approaches. Int. J. Health Sci. 2022, 6, 5432–5445. [Google Scholar] [CrossRef]
- Doğruöz, A.S.; Sitaram, S.; Bullock, B.E.; Toribio, A.J. A survey of code-switching: Linguistic and social perspectives for language technologies. arXiv 2023, arXiv:2301.01967. [Google Scholar]
- Ogunleye, B.; Maswera, T.; Hirsch, L.; Gaudoin, J.; Brunsdon, T. Comparison of Topic Modelling Approaches in the Banking Context. Appl. Sci. 2023, 13, 797. [Google Scholar] [CrossRef]
- Jain, L.; Sharma, M.; Abdulsada, Z.R. Offensive Tweets Detection in Hinglish Using HingBERT. Int. Conf. Data Anal. Manag. 2023, 10, 93–103. [Google Scholar]
- Shevtsov, A.; Tzagkarakis, C.; Antonakaki, D.; Pratikakis, P.; Ioannidis, S. Twitter Dataset on the Russo-Ukrainian War. arXiv 2022, arXiv:2204.08530. [Google Scholar]
- Siapera, E.; Hunt, G.; Lynn, T. #GazaUnderAttack: Twitter, Palestine and diffused war. Inf. Commun. Soc. 2022, 22, 1297–1319. [Google Scholar]
- Chen, E.; Ferrara, E. Tweets in time of conflict: A public dataset tracking the twitter discourse on the war between Ukraine and Russia. arXiv 2022, arXiv:2203.07488. [Google Scholar] [CrossRef]
- Smart, B.; Watt, J.; Benedetti, S.; Mitchell, L.; Roughan, M. #IStandWithPutin versus #IStandWithUkraine: The interaction of bots and humans in discussion of the Russia/Ukraine war. Soc. Inform. 2022, 13618, 34–53. [Google Scholar]
- Askasnr, S. End of US-Afghan War Tweet Data. 2012. Available online: https://www.kaggle.com/datasets/aska88/end-of-usafghan-war-tweet-data (accessed on 11 August 2021).
- Ashish, K.; Abhishek, M.; Ayush, A.; Rachna, J.; Monika, A. Sentiment Analysis on Multilingual Data: Hinglish. In International Conference on Data Analytics & Management; Springer: Berlin/Heidelberg, Germany, 2023; pp. 607–620. [Google Scholar]
- Agarwal, N.S.; Punn, N.S.; Sonbhadra, S.K. Exploring Public Opinion Dynamics on the Verge of World War III Using Russia-Ukraine War-Tweets Dataset; Knowledge Discovery and Data Mining-Undergraduate Consortium: Washington, DC, USA, 2022. [Google Scholar]
- Naz, H.; Ahuja, S.; Kumar, D.R. DT-FNN Based Effective Hybrid Classification Scheme for Twitter Sentiment Analysis. Multimed. Tools Appl. 2021, 80, 11443–11458. [Google Scholar] [CrossRef]
- Staal, N. War of the Tweets: An Analysis of American and Russian Information Operations on Twitter following the August, 2013 Sarin Gas Massacre in Syria. Royal Millitary Collge of Canada, 2016. Available online: https://espace.rmc.ca/jspui/handle/11264/1041 (accessed on 1 February 2024).
- Chakravarthi, B.R. Hope speech detection in YouTube comments. Soc. Netw. Anal. Min. 2022, 12, 75. [Google Scholar] [CrossRef] [PubMed]
- Bhatia, K.V. Hindu nationalism online: Twitter as discourse and interface. Religions 2022, 13, 739. [Google Scholar] [CrossRef]
- Rastogi, S.; Bansal, D. Visualization of Twitter sentiments on Kashmir territorial conflict. Cybern. Syst. 2021, 52, 642–669. [Google Scholar] [CrossRef]
- Srivastava, V.; Singh, M. Hinge: A dataset for generation and evaluation of code-mixed hinglish text. arXiv 2021, arXiv:2107.03760. [Google Scholar]
- Srivastava, V.; Singh, M. PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation. arXiv 2020, arXiv:2004.09447. [Google Scholar]
- Kaur, G.; Kaur, A.; Khurana, M. A stem to stern sentiment analysis emotion detection. In Proceedings of the 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 13–14 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–5. [Google Scholar]
- Alslaity, A.; Orji, R. Machine Learning Techniques for Emotion Detection and Sentiment Analysis: Current State, Challenges, and Future Directions. Behav. Inf. Technol. 2024, 43, 139–164. [Google Scholar] [CrossRef]
- Ruytenbeek, N.; Decock, S.; Depraetere, I. Experiments into the influence of linguistic (in) directness on perceived face-threat in Twitter complaints. J. Politeness Res. 2023, 19, 59–86. [Google Scholar] [CrossRef]
- Sharif, W.; Mumtaz, S.; Shafiq, Z.; Riaz, O.; Ali, T.; Husnain, M.; Choi, G.S. An empirical approach for extreme behavior identification through tweets using machine learning. Appl. Sci. 2019, 9, 3723. [Google Scholar] [CrossRef]
- Ramesh, T.; Lilhore, U.K.; Poongodi, M.; Simaiya, S.; Kaur, A.; Hamdi, M. Predictive analysis of heart diseases with machine learning approaches. Malays. J. Comput. Sci. 2022, 132–148. [Google Scholar]
- ElKafrawy, P.; Mahgoub, A.; Atef, H.; Nasser, A.; Yasser, M.; Medhat, W.M.; Darweesh, M.S. Sentiment Analysis: Amazon Electronics Reviews Using BERT and Textblob. In Proceedings of the 20th International Conference on Language Engineering, Cairo, Egypt, 12–13 October 2022. [Google Scholar]
- Chuang, J.; Manning, C.D.; Heer, J. Termite: Visualization Techniques for Assessing Textual Topic Models. In Proceedings of the International Working Conference on Advanced Visual Interfaces, Capri Island, Italy, 21–25 May 2012; pp. 74–77. [Google Scholar]
- Sievert, C.; Shirley, K. LDAvis: A Method for Visualizing and Interpreting Topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, MD, USA, 29 June 2014. [Google Scholar]
- Pratibha; Kaur, A.; Khurana, M. Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis. 2023. Available online: https://data.mendeley.com/datasets/y63frd6pmf/3 (accessed on 29 December 2023).
- Verma, K.; Bhardwaj, S.; Arya, R.; Islam, U.; Bhushan, M.; Kumar, A.; Samant, P. Latest tools for data mining and machine learning. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 18–23. [Google Scholar]
Dataset Name | Source | Time Period | Number of Tweets | Language |
---|---|---|---|---|
Russo Ukrainian War2 [26] | Twitter/‘X’ | 22 February 2022 | 57,384,192 | English Russia |
Gaza Conflict [27] | Twitter/‘X’ | 5 July to 26 August 2014 | 49,205,389 | English, Spanish, Indonesian, French |
War Between Ukraine and Russia [28] | Twitter/‘X’ | 22 February 2022 to 8 January 2023 | 454,488,445 | English, French, Italian, German |
Tweets discussing the Russia/Ukraine War [29] | Twitter/‘X’ | 23 February and 8 March 2022 | 5,203,764 | English |
End of US-Afghan War Tweet Data [30] | Twitter/‘X’ | 11 August 2021 to 27 August 2021 | 359,904 | English |
Ukraine-Russia [31] | Twitter/‘X’ | 22 February 2022 through 8 March 2022 | 63 millions | English, French, Italian, German, Ukrainian |
Russia-Ukraine war-Tweets Dataset [32] | Twitter/‘X’ | 31 December 2021 to 3 March 2022 | 1,316,605 | English, Spanish, French, German |
Kunduz Madrassa attack by Afghan [33] | Twitter/‘X’ | 2 April 2018 to 8 April 2018 | 7500 | English |
Sarin Gas Massacre in Syria [34] | Twitter/‘X’ | August 2013 | 4 million | English |
Hope speech detection in YouTube comments [35] | Youtube | November 2019 to June 2020 | 59,354 | English, Tamil, and Malayalam |
Hindu Nationalism Online: Twitter/‘X’ as Discourse and Interface [36] | Twitter/‘X’ | starting in December 2019 and concluding in May 2020 | 20,370,555 | English |
Kashmir Conflict [37] | Twitter/‘X’ | 2 December 2019 to 12 January 2020 | 60k | English |
HinGE A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text [38] | Twitter/‘X’ | – | 10,731 | Hinglish |
PHINC A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation [39] | Twitter/‘X’ Facebook | – | 13,738 | Hinglish |
Category | Seed Keywords That Can Be Attached to Event of Wars and Conflicts |
---|---|
Sentiments | Love, Hate, Fear, Hope, Anger |
Raw Emotions | Sadness, Joy, Disgust, Surprise |
Opinions | Support, Opposition, Indifference, Ambivalence |
Tweet ID | Tweet Text | Sentiment Category |
---|---|---|
1 | “I’m so scared of what’s going to happen in the world” | Fear |
2 | “The thought of war terrifies me” | Fear |
3 | “I can’t sleep at night thinking about the possibility of a world war” | Fear |
4 | “This is really freaking me out” | Fear |
5 | “I’m filled with dread about the state of the world” | Fear |
Cluster 1 Keywords | Data, code, doshi, Khud, alok, maanana. Task, different, domain, across. Mixing, Sambanndh, Asantosh, sentiment, text |
Possible Annotations | Data Integration and Sentiment Analysis (L1), Self-Reflection and Opinion (L2) |
Pragmatic Analysis | This label (L1) captures the core concepts of combining data from different domains, potentially for sentiment analysis or opinion mining. The keywords “data”, “code”, “mixing”, “across”, “domain”, “sentiment”, and “text” strongly suggest this theme. This cluster could represent discussions about techniques for integrating data from various sources to understand sentiment or emotions expressed in text. It might involve discussions of tools, challenges, or practical applications in this field. This label (L2) highlights the presence of words like “khud” (self), “alok” (light, perspective), “maanana” (acceptance), “asantosh” (satisfaction), and “sentiment”, which could indicate discussions about personal opinions, self-reflection, and satisfaction with outcomes. This cluster (L2) might contain tweets where people express opinions, reflect on their experiences, or share their satisfaction regarding various topics. |
Cluster 2 Keywords | Shak, dil, virahit, say umeed, ko, suhkriya, doshi, dishearted, guilt data, kal, discontent, shivering, infuriating, linguistics |
Possible annotations | Emotional Linguistics (L3), Discontent (L4) |
Pragmatic Analysis | The presence of words like “shak” (doubt), “dil” (heart), “virahit” (deprived), “umeed” (hope), “dishearted”, “guilt”, “discontent”, and “infuriating” suggests a strong emotional context. These words collectively represent various sentiments and emotional states, indicating that the cluster 2 may involve texts dealing with feelings, emotional expressions, or discussions around emotional topics. The term “emotional linguistics (L3) ” suggests an analytical or structured discussion around language, possibly in the context of these emotional expressions. Several words like “dishearted”, “guilt”, ”discontent”, and “infuriating” specifically point to negative emotions or states of dissatisfaction. This indicates that cluster 2 may represent discussions or texts that revolve around themes of regret, anger, frustration, or general dissatisfaction. Hence, the label Discontent (L4) |
Cluster 3 Keywords | ninda, ashanka, ashirwad, daya, ruffeled, achha, vibram, feathers, mad, heartendness, deep, tantrum mad, bitter, process. |
Possible annotation | Emotional Turmoil (L5), Reflections (L6) |
Pragmatic Analysis | Words like “ninda” (criticism or condemnation), “ashanka” (doubt or suspicion), “mad” (angry or intense), “ruffled”, “tantrum”, and “bitter” indicate strong negative emotions or states of disturbance. These terms suggest discussions or expressions of conflict, upset, or emotional unrest. Hence, Emotional Turmoil (L5) is best suited. On the other hand, words like “ashirwad” (blessings), “daya” (compassion or mercy), “achha” (good), and “heartendness” (perhaps a misspelling or variation of ’heartedness’ or ’heartening’) can imply a more reflective or positive aspect. This duality of negative and positive emotional language indicates a complex emotional landscape, perhaps reflecting on both the turmoil and the more compassionate or hopeful aspects of the human experience. Hence, the annotation Reflections (L6) |
Cluster 4 Keywords | pashtaap, prayaschit, koti, sunehra, fumming, funereal, ghabrahat, naman, chill, agony, creeping, anguish, humbeled, words sullen, |
Possible annotations | Contemplative Remorse (L7), Solemnity (L8) |
Pragmatic Analysis | Keywords such as “pashtaap” (regret or remorse), “prayaschit” (atonement or penance), and “fumming” (a variant of ’fuming’, indicating anger or frustration) suggest a deep sense of reflection on past actions or emotions. This reflection is typically associated with feelings of guilt, regret, or a desire to make amends, indicative of a contemplative and remorseful state. Hence, the label Contemplative Remorse (L7). Words such as “funereal” (relating to a funeral or death), “ghabrahat” (anxiety or unease), “chill”, “agony”, “anguish”, and “sullen” (bad-tempered or gloomy) all contribute to a solemn or deeply serious tone. “Naman” (a gesture of respect or salutation) and “humbled” also suggest reverence or a subdued demeanor, often found in solemn or serious circumstances. |
Cluster 5 Keywords | dukh, vismay, bhoot, krodh, mujbuti, sorrow, downcast, khed, deep, apology, good, creepy, gussa, prokop |
Possible annotation | Sorrow (L9), Indignation (L10) |
Pragmatic Analysis | Sorrow (L9): Words like “dukh” (sorrow), “sorrow” itself, “downcast”, and “khed” (regret or sorrow) clearly point towards a theme of sadness and regret. These words suggest discussions or expressions that revolve around personal grief, disappointment, or general sadness. Indignation (L10) Terms such as “krodh” (anger), “gussa” (anger), and “prokop” (fury or rage) indicate strong feelings of anger or annoyance. “Vismay” (wonder or surprise) can sometimes be associated with shock or disbelief that could lead to indignation, depending on the context. |
Cluster 6 Keywords | udaas, ahyankar, utsah, bharosha, downherted, bhavishya, pareshani, shakti, hona, umang, doshi, adhbut, honor, nayi, irate |
Possible annotation | Sadness (L11), Resilience (L12) |
Pragmatic Analysis | Sadness (L11) The cluster includes words that cover a wide range of emotions. “Udaas” (sad), ”downhearted“, and ”irate“ suggest feelings of sadness and anger. In contrast, ”utsah“ (enthusiasm), “bharosha” (trust), “umang” (joy or enthusiasm), and “shakti” (strength) indicate positive emotions and qualities. Resilience and Hope (L12) Words like “bharosha”, “umang”, and “shakti” not only represent positive emotions but also suggest a sense of resilience and hope. “Bhavishya” (future) and “nayi” (new) reinforce this theme, indicating a forward-looking or hopeful perspective. |
Cluster 7 Keywords | dukhi, maafi, thanks, bhavuk, upset, prerena, dhundla, vinamarta, bhayanak, chiddhana, garmi, uproar, dreary, angrily sambhavana |
Possible annotation | Melancholic Stirrings (L13), Humility and Remorse (L14) |
Pragmatic Analysis | Melancholic Stirrings: Words like “dukhi” (sad), “upset”, “bhavuk” (emotional), “dhundla” (blurry or unclear, often metaphorically used to represent confusion or lack of clarity), and “dreary” suggest a theme of sadness, emotional depth, and a general melancholic or downcast mood. “Angrily” and “uproar” indicate a disturbance or intense emotional reaction, adding to the sense of emotional stirrings. Humility and Remorse: “Maafi” (forgiveness or apology), “vinamarta” (humility), and “thanks” imply a sense of remorse, gratitude, or humbleness. These terms suggest an acknowledgment of mistakes, appreciation for others, or a general attitude of humility and respect. |
Cluster 8 Keywords | nirasha, khushi, chinta, anukool, visvash, rona, sankalap, ekta, pragati, ashirvad, wathful, stormy, heavy, reproach, vyaakul |
Possible annotations | Mixed Emotions, Resilience and Unity in Adversity |
Pragmatic Analysis | Mixed Emotions (L15) The cluster encompasses a range of positive and negative emotions, suggesting a focus on mixed feelings and experiences.
Pragmatic analysis: The cluster touches on themes of disappointment (“nirasha”), happiness (“khushi”), worry (“chinta”), trust (“visvash”), crying (“rona”), and reproach (“reproach”). This suggests a discourse that grapples with both positive and negative aspects of life. Resilience and Unity in Adversity (L16) The keywords in the cluster emphasises unity and collective strength in the face of challenges. The presence of words like “ekta” (unity), “visvash” (trust), and “anukool” (supportive) suggests a discourse that focuses on finding strength in togetherness and overcoming obstacles together. |
Cluster 9 Keywords | doah, dosh, apmaan, vivklit, khushi, laaz, sukoon, bhavana, bachaini, vidhva, blame, dekhbaal, frenyize, aas, sukoon |
Possible annotation | Turmoil (L17), Serenity (L18) |
Pragmatic Analysis | The label “Turmoil and Serenity in Self and Relationships” captures the essence of the cluster’s topics, suggesting that the texts or tweets likely involve discussions or expressions navigating through emotional and moral complexities within oneself and in relation to others. It reflects the dynamic interplay between challenging and comforting emotions and situations, as well as the pursuit of understanding, peace, and resolution in personal and social contexts. Turmoil (L17) Words like “doah” (doubt or blame), “dosh” (fault or blame), “apmaan” (insult), “vivklit” (perplexed), “bachaini” (restlessness), and “frenzyize” (likely a misspelling or variation of “frenzied”, indicating chaotic or wild behavior) suggest a state of emotional and moral turmoil. These terms indicate discussions or expressions of conflict, guilt, agitation, or blame. Serenity (L18) In contrast, words like “khushi” (happiness), “laaz” (shame but can also imply honor in certain contexts), “sukoon” (peace or tranquility), and “aas” (hope) reflect a more positive or peaceful emotional state. This indicates a movement or desire towards tranquility, contentment, and positive emotional and moral states. |
Cluster 10 Keywords | utsaah, shama, ullas, santusti, jazbaat, ashvasan, samaghdari, gratitude, mukti, chidhaavat, human, influriation, inconsolable, Connection |
Possible annotation | Positive Emotional Dynamics (L19), Gratitude and Liberation (L20) |
Pragmatic Analysis | Positive Emotional Dynamics (L19): Words like “utsaah” (enthusiasm), “shama” (forgiveness or patience), “ullas” (joy), “santusti” (satisfaction), “jazbaat” (emotions), and “ashvasan” (assurance) denote positive emotional states and qualities. These terms suggest expressions of joy, emotional richness, and a sense of fulfilment or contentment. Gratitude and Liberation (L20): The inclusion of “gratitude” and “mukti” (liberation or freedom) adds layers of thankfulness and the concept of emotional or spiritual freedom. This implies discussions or expressions that revolve around being grateful, finding peace, or achieving a sense of liberation. |
Serial Number | Variable | Description |
---|---|---|
1 | Number of Search Filters | 500 |
2 | Total Tweets | 10,040 |
3 | Fields in each Tweet | Tweet ID, Tweet, Retweet |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pratibha; Kaur, A.; Khurana, M.; Damaševičius, R. Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis. Data 2024, 9, 38. https://doi.org/10.3390/data9020038
Pratibha, Kaur A, Khurana M, Damaševičius R. Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis. Data. 2024; 9(2):38. https://doi.org/10.3390/data9020038
Chicago/Turabian StylePratibha, Amandeep Kaur, Meenu Khurana, and Robertas Damaševičius. 2024. "Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis" Data 9, no. 2: 38. https://doi.org/10.3390/data9020038
APA StylePratibha, Kaur, A., Khurana, M., & Damaševičius, R. (2024). Multimodal Hinglish Tweet Dataset for Deep Pragmatic Analysis. Data, 9(2), 38. https://doi.org/10.3390/data9020038