1. Introduction
Presidential elections represent the pinnacle of the political process, serving as the decisive mechanism through which leaders, responsible for shaping state policies in the forthcoming term, are selected. The enthusiasm of the community in the moment of the presidential election has always been in the spotlight and triggers intense debates such as debates related to political ideology, social and economic issues, dissatisfaction with the previous government, identity and religious issues, social media, and disinformation. In 2024, Indonesia held simultaneous general elections to select a presidential candidate. More than two hundred million voters went to the polling stations on 14 February 2024 to elect a President. Politically interested social media users share and seek information about politics on social media [
1]. Political campaigns have exploited the wide range of information available on various social media platforms to gain insight into user opinions and thereby design campaign strategies. The huge investments made by politicians in social media campaigns just before general elections along with the arguments and debates between their supporters and opponents only enhance the claim that the views and opinions posted by users have an influence on the outcome of elections. Users have an influence on the outcome of elections [
2], as evidenced by the rapid growth of studies on the impact of social media on presidential elections in Indonesia, where “Indonesian author’s studies on the presidential election in social media have experienced a rapid increase in recent years” [
3].
Sentiment analysis is a branch of learning in the field of text mining that studies the analysis of opinions, emotions, feelings, attitudes, and evaluations expressed in the form of text [
4,
5]. In this research, we used YouTube for sentiment analysis of the 2024 Indonesian presidential and vice-presidential debates. YouTube is a popular platform among Indonesians. According to Global Media Insight, Indonesia ranks fourth globally in terms of number of YouTube users. Therefore, the YouTube platform can be regarded as a valuable medium for collecting public opinion on the 2024 elections. YouTube comments can capture immediate reactions from respondents or viewers without the need for real-time prompting [
6].
Figure 1 shows YouTube search trends for the “Pemilu”, which means election. At the beginning of the period, the search trends showed a low score of 10. However, as time passed, there was a significant increase in this trend. The peak of the trend occurred in January 2024, with a score of 98 which signaled the peak of popularity in the number of searches in the same category. User comments, responses, and interactions on the YouTube platform can provide insight into views and opinions regarding presidential candidates, political issues, and election results.
One type of data that can be extracted from YouTube is comments [
7]. The next step after extracting the data is to perform text mining using sentiment analysis. Sentiment analysis is conducted to detect and quantify emotional expressions in comments, including emotions such as anger, anticipation, disgust, joy, fear, sadness, surprise, and trust. The classification of comments into these eight emotions is important because emotion provides specific insight into how the public feels and reacts to a particular issue or candidate. These classifications offer several benefits, including helping political parties and candidates understand the public’s emotional landscape more comprehensively. By identifying various emotions, it is possible to uncover not just positive or negative sentiments but also the intensity and nature of public reactions. For example, anger or disgust can indicate dissatisfaction or rejection, while joy or trust can reflect support or confidence. Fear or sadness might highlight concerns or apprehension, while anticipation and surprise may signal engagement and expectations. In this study, multi-label classification is applied to a sentiment analysis of YouTube comments on the 2024 presidential candidate debates, offering a fresh perspective compared to traditional classification methods. Multi-label emotion classification allows for a more nuanced and comprehensive understanding of public sentiment toward the candidates. This approach broadens the insights gained from sentiment analysis and reveals a wider spectrum of emotional reactions, offering more depth compared to traditional sentiment classifications. Multi-label classification, including the classification of eight emotions, allows for more detailed and accurate sentiment analysis compared to binary or ternary classification [
8,
9,
10]. Emotional sentiment analysis not only enables the exploration of public opinions but also influences the formation of public opinions towards the candidates [
11]. Understanding the full spectrum of emotional reactions allows for the development of more targeted strategies, addressing public concerns and improving campaign effectiveness.
Emotion classification can be performed using deep learning algorithms, such as Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and a combination of CNN and Bi-LSTM. CNN is a suitable algorithm in this research due to the way CNN works in accordance with multi-label modeling that can extract local features from complex data. Modified CNN models for multi-label classification demonstrate varying degrees of effectiveness based on the number of labels and dataset complexity. These CNNs, which incorporate word embeddings followed by convolutional and dense layers, are employed to tackle challenges in extreme multi-label text classification problems [
12]. LSTM is also applicable to multi-label data, with the capability to identify context and relationships within text. It can effectively capture long-term dependencies and understand the context in textual data [
13]. Long short-term memory (LSTM) models have been effectively applied to enhance aspect-based sentiment analysis, significantly boosting accuracy by capturing sentiment that is dependent on context [
14]. Bi-LSTM, an advanced version of LSTM, is crucial in addressing challenges in multi-label text classification. By integrating two LSTMs into a single model, Bi-LSTM can capture contextual information from both forward and backward directions [
15]. On the other hand, CNN Bi-LSTM uses a combination of CNN and Bi-LSTM that combines CNN’s ability to capture patterns and LSTM’s ability to capture context [
16]. The hybrid Bi-LSTM CNN model is considered to enhance text classification accuracy by integrating the LSTM and CNN models, along with attention mechanisms [
17].
Therefore, the key research question explored in this study is how effective different deep learning models are in classifying these multi-label emotions in YouTube comments related to the 2024 Indonesian presidential and vice-presidential debates. This research is also limited in the use of models in performing classification, as only traditional models such as CNN, LSTM, and Bi-LSTM are used. However, the use of models in this study can be used in future research as a baseline in the context of politics in Indonesia, especially in the context of general elections.
2. Systematic Literature Review
In conducting this research, a systematic literature review (SLR) was employed to ensure that all relevant studies were identified, screened, and evaluated comprehensively. This approach was used to gather, evaluate, and synthesize existing research on research topics, ensuring that the conclusions drawn were based on the most robust evidence available [
18]. Following the guidelines of the SLR methodology in
Figure 2, this review was divided into several stages, including study identification, screening, eligibility assessment, and inclusion in the final synthesis.
The initial phase involved importing 604 studies from the Scopus database. These studies were collected as potential sources for this research, forming the basis of the screening process. During this stage, sixteen duplicate studies were identified and removed to avoid redundant analysis. Following the identification stage, five hundred studies remained for further screening. The screening process involved evaluating the relevance of these studies based on predefined inclusion and exclusion criteria. Studies that did not meet the inclusion criteria were removed. As a result, 297 studies were deemed irrelevant, while 203 studies were considered relevant for further assessment. After the screening phase, eleven full-text studies were assessed for eligibility. This phase ensured that each study was thoroughly reviewed based on its content, research findings, and relevance to the research topic. During this process, 192 studies were excluded for reasons such as being off topic, presenting unsuitable results, or lacking focus on the relevant research area.
Eleven studies were included in the final synthesis. These studies were considered to meet all the inclusion criteria and provided substantial contributions to understanding the research area. The final pool of studies forms the basis of the discussion and analysis presented in the subsequent chapters of this thesis. By employing systematic literature review methodology, this research ensures a structured and comprehensive analysis of the existing literature, thereby strengthening the conclusions and providing a well-founded basis for the exploration of the research area.
Based on the extraction shown in
Table 1, the eleven studies included through the systematic literature review (SLR) process, this research offers several key contributions that distinguish it from previous studies.
While earlier research explored various approaches to sentiment analysis and multi-label classification using machine learning and deep learning techniques, this study specifically focuses on multi-label sentiment analysis for emotion classification related to the Indonesian presidential election. Previous studies, such as Wisnubroto (2022) [
19] and Mandhasiya (2022) [
20], concentrated on developing models for opinion-based classification, while Jabreel (2019) [
21] delved into deep learning techniques for different predictive tasks. Additionally, the work by He (2018) [
23] examined binary techniques for complex data classification.
However, the novelty of this research lies in its specific application of multi-label emotion classification using deep learning models like CNN, Bi-LSTM, and a hybrid CNN-BiLSTM model, tailored for sentiment analysis in the context of Indonesian presidential candidates. This study leverages a unique dataset consisting of online reviews and comments directly related to the political contest in Indonesia. In contrast, previous studies, such as those by Tripto (2018) [
24] and Samy (2018) [
25], did not specifically focus on multi-label approaches or the political domain, let alone in the context of the Indonesian presidential election.
The prominent difference between this research and the papers listed in the systematic literature review (SLR) table is the use of a multi-label sentiment analysis approach in the context of the 2024 Indonesian presidential election, implemented through a combination of deep learning models (CNN, Bi-LSTM, CNN-BiLSTM). Most previous studies included in the table focus more on traditional algorithms such as Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM), and do not explicitly employ a multi-label approach, instead opting for binary or multi-class classification. Some studies also adopt deep learning models like LSTM and BERT; however, they do not implement the complexity of multi-label analysis with various emotions and candidates as applied in this research.
The primary novelty of this research lies in the use of CNN-BiLSTM, which combines the strengths of convolutional networks and long short-term memory networks to detect more complex patterns in YouTube comments related to the 2024 election. While transformer models like BERT leverage attention mechanisms to capture contextual relationships and dependencies across sequences, the CNN and Bi-LSTM models used in this research focus on extracting spatial features and capturing long-term dependencies, respectively. This hybrid approach allows for a more nuanced analysis of sentiment by incorporating both local and temporal contexts. Furthermore, this research achieves a very high AUC of 0.91, indicating superior model performance in classifying multi-label sentiments regarding presidential candidates. Other approaches in prior studies have yet to implement multi-label analysis with such high precision for similar scenarios. Thus, the main novelty of this research is the application of a multi-label sentiment analysis method using more complex deep learning models and superior performance evaluation results compared to previous studies.
While this research also utilizes multi-label approaches as well as traditional deep learning models such as CNN, LSTM, and Bi-LSTM that have been explored previously, research into the political context of Indonesia—specifically the Indonesian presidential election and the use of YouTube for data collection—will provide unique challenges. This research not only provides an analysis of emotional reactions to voters, but also provides information to formulate more effective campaign approaches in the political dynamics in Indonesia.
Therefore, the novelty of this research is highlighted not only by the methodology and the combination of deep learning models employed but also by the specific application of these methods to political sentiment analysis in Indonesia. This research aims to provide new insights and a more accurate model for analyzing political sentiment in the digital age, especially in the context of elections. This study focuses on the local political context of the 2024 Indonesian presidential election and employs a multi-label classification of emotions, ensuring that there is no duplication with previous research in this area.
4. Result and Discussion
The Results and Discussion chapter provide a comprehensive evaluation of the three models: CNN, Bi-LSTM, and CNN Bi-LSTM. An analysis based on evaluation metrics, such as accuracy, precision, recall, F1-score, AUC, and Hamming loss, are provided to evaluate each model’s performance in multi-label sentiment classification. This chapter discusses the strengths and weaknesses of each model, offering deeper insights into the effectiveness of deep learning approaches for sentiment analysis in the context of the Indonesian presidential election. This comparison aims to offer valuable guidance for selecting the most suitable model for similar applications in the future.
4.1. Confusion Matrix
The confusion matrix is an evaluation tool that displays the comparison between model predictions and actual labels in the form of a matrix. This matrix shows the number of True Positives (TP), True Negatives (TN), False Positives (FP), and False Positives (FN) for each class. The confusion matrix helps in understanding the distribution of prediction errors and how well the model classifies each class.
4.3. Hamming Loss
Hamming loss is a metric utilized to assess the model prediction error rate in multi-label classification problems. It measures the proportion of incorrect labels to the total, ideal for scenarios requiring individual label evaluation. This characteristic allows for an understanding of not only the overall performance but also the identification of specific labels that the model may be misclassifying. By computing the proportion of incorrectly predicted labels to the total labels, Hamming loss facilitates a detailed analysis, which is especially valuable in multi-label contexts where each label holds distinct importance.
In the realm of multi-label classification, Hamming loss is particularly pertinent as it allows for the evaluation of errors on each predicted label, providing deeper insights into areas where model performance may be deficient. The Hamming Loss vs. Threshold graph elucidates the relationship between the threshold value used in the Bi-LSTM model and the model’s performance as measured by the Hamming loss metric. The Y-axis in this graph shows the Hamming loss value, indicating how often the model mispredicts labels. The X-axis represents the threshold used to decide if a label is correct.
The evaluation of Hamming loss in CNN, Bi-LSTM, and CNN-BiLSTM models shows a consistent pattern related to determining the optimal threshold. In the CNN model, as shown in
Figure 10, the Hamming loss is at a high value of 0.2112 when the threshold is at a low point of 0.1. This shows that at low thresholds, the model tends to predict labels incorrectly. As the threshold increases, the Hamming loss decreases until it reaches a low of 0.1000 at a threshold of 0.6, signaling the best performance of the model. However, once the threshold exceeds 0.5, the Hamming loss increases again, indicating that the correct predictions start to decrease. The Bi-LSTM model, as can be seen in
Figure 10, shows similar results. At a threshold of 0.1, the Hamming loss is at a high number of 0.1397. As the threshold increases, the Hamming loss value decreases significantly and reaches its lowest point at 0.0816 at threshold 0.5. This shows the best performance of the Bi-LSTM model with the lowest prediction error at this point. As with the CNN model, once the threshold exceeds 0.5, the Hamming loss increases again, indicating that correct predictions start to be missed.
In the CNN-BiLSTM model, the Hamming loss evaluation shown in
Figure 10 also follows a similar pattern. At a low threshold of 0.1, the Hamming loss is at 0.2283, which indicates many prediction errors. As the threshold increases, the Hamming loss decreases significantly and reaches the lowest value of 0.1107 at a threshold of 0.5. After this threshold is exceeded, the Hamming loss rises again, indicating that the correct predictions start to decrease. Based on the results of this evaluation, a threshold of around 0.5 consistently provides the lowest Hamming loss value for all three models. This indicates that it is the optimal point for the models to predict labels with the lowest error rate. If the threshold is set too low or too high, the Hamming loss increases, which means more prediction errors occur.
Overall, choosing the right threshold, which is around 0.5, is very important to minimize Hamming loss and improve accuracy in multi-label classification. All three models show their best performance in predicting labels at this threshold, although the Bi-LSTM model shows the lowest Hamming loss among the three.
Hamming loss is beneficial as it effectively captures the fraction of incorrect label predictions in multi-label scenarios, particularly highlighting issues with overlapping labels. However, it has limitations, such as ignoring True Positives and being sensitive to label imbalance, which can skew results. To address overlapping labels, strategies like adjusting the classification threshold or using additional metrics alongside Hamming loss can provide a more nuanced evaluation of model performance.
5. Deployment
In this section, the researchers describe the deployment stage of the Bi-LSTM model that has been evaluated and proven to show the best performance. The deployment process involves several important steps to ensure the model can be implemented effectively. The first step in deployment includes the preparation of the framework used such as Flask for website development and various machine learning libraries used by the Bi-LSTM model. This preparation is used to ensure that the entire deployment process can run well.
This deployment process is carried out using the Flask framework to build a web application that can classify new YouTube comments that have not been labeled as in
Table 15 and provide output in the form of files containing comments along with the appropriate labels. After the user uploads the XLSX file containing the new YouTube comment through the web interface as shown in
Figure 11, the file is stored in a special directory to facilitate further processing. Furthermore, the application reads the uploaded file and extracts the comments it contains. The comments are then preprocessed, including going through tokenization and padding stages, to match the input format required by the BI-LSTM model. The BI-LSTM model then performs predictions for each comment and generates the corresponding labels.
The resulting label is added to the original file as a new column. The file that has been labeled as in
Table 16 is then saved in the output directory. Users can download the labeled file through the Flask application web interface as shown in
Figure 12. Thus, the evaluated Bi-LSTM model can proceed to the deployment stage to process YouTube comments automatically and provide appropriate results in the form of labeled xlsx files. This web application makes it easy for users to upload unlabeled comments and obtain labeling results quickly and efficiently. After the labeling process is complete, the dataset can be used to perform various visualizations in analyzing the sentiment of YouTube comments related to the presidential candidate debate that have been labeled.
Figure 13 shows the volume of conversation for each candidate, Anies, Prabowo, and Ganjar, based on YouTube comments from the Indonesian presidential candidate debate. The data visualization on this graph highlights the total number of comments that mention each candidate.
In
Figure 13, the Anies graph has the highest volume of conversation with more than 10,000 mentions. This shows that Anies is the most talked about candidate among the three candidates. This high volume of conversation can be caused by various factors such as popularity, controversy, or certain topics that attract public attention. Prabowo is in second place after Anies, with around 7500 mentions. Although lower than Anies, this number still shows significant interest in Prabowo. Meanwhile, Ganjar has the lowest volume of conversation with less than 2000 mentions. This shows that Ganjar is the least talked about candidate among the candidates.
In
Figure 14, candidate Anies has the highest level of trust with more than 6000 mentions. This level of trust indicates that many commentators have a high level of trust in Anies. The emotion of joy is in second place after trust with more than 5000 mentions, indicating that Anies received many positive and joyful comments. The emotion of anticipation has around 2500 mentions, indicating the hope or expectation of the public towards Anies. Meanwhile, negative emotions such as anger, disgust, fear, sadness, and surprise have a lower number of mentions.
Prabowo also shows dominance in the emotion of trust with around 4000 mentions, indicating a high level of trust from the public. The emotion of joy is also significant with around 3000 mentions, indicating strong positive support. The emotion of anticipation appears at around 2000 mentions, indicating hope or expectation from the public towards Prabowo. Negative emotions such as anger, disgust, fear, sadness, and surprise are also present in lower numbers, indicating some criticism or concerns that need to be addressed by Prabowo’s campaign team.
Candidate Ganjar had a lower volume of conversations, showing dominance in the emotion of trust with over 1000 mentions, much lower than the other two candidates. The emotion of joy also appeared around 7000 mentions, signaling positive support albeit on a smaller scale. The emotions of anticipation, anger, disgust, fear, sadness, and surprise had lower numbers of mentions but remained significant, indicating that although Ganjar received fewer comments, these emotions were still present in public discussion.
Overall, Anies had the highest number of positive mentions with over 6000 for trust and over 5000 for excitement, indicating very strong support from the public. Prabowo also has significant levels of trust and excitement, but lower than Anies. Ganjar, although with a lower volume of conversations, still showed dominance in the emotions of trust and excitement, signaling support, although not as great as the other two candidates. Negative emotions were present in lower numbers for all candidates, indicating an area of concern for improving public sentiment.