Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm

Ma’aly, Ahmad Nahid; Pramesti, Dita; Fathurahman, Ariadani Dwi; Fakhrurroja, Hanif

doi:10.3390/info15110705

Open AccessArticle

Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm

by

Ahmad Nahid Ma’aly

¹

,

Dita Pramesti

^1,*

,

Ariadani Dwi Fathurahman

¹

and

Hanif Fakhrurroja

^1,2

¹

School of Industrial Engineering, Telkom University, Bandung 40257, West Java, Indonesia

²

Research Center for Smart Mechatronics, National Research and Innovation Agency, Bandung 40135, West Java, Indonesia

^*

Author to whom correspondence should be addressed.

Information 2024, 15(11), 705; https://doi.org/10.3390/info15110705

Submission received: 29 September 2024 / Revised: 26 October 2024 / Accepted: 29 October 2024 / Published: 5 November 2024

(This article belongs to the Special Issue Machine Learning and Data Mining for User Classification)

Download

Browse Figures

Versions Notes

Abstract

:

Presidential elections are an important political event that often trigger intense debate. With more than 139 million users, YouTube serves as a significant platform for understanding public opinion through sentiment analysis. This study aimed to implement deep learning techniques for a multi-label sentiment analysis of comments on YouTube videos related to the 2024 Indonesian presidential election. Offering a fresh perspective compared to previous research that primarily employed traditional classification methods, this study classifies comments into eight emotional labels: anger, anticipation, disgust, joy, fear, sadness, surprise, and trust. By focusing on the emotional spectrum, this study provides a more nuanced understanding of public sentiment towards presidential candidates. The CRISP-DM method is applied, encompassing stages of business understanding, data understanding, data preparation, modeling, evaluation, and deployment, ensuring a systematic and comprehensive approach. This study employs a dataset comprising 32,000 comments, obtained via YouTube Data API, from the KPU and Najwa Shihab channels. The analysis is specifically centered on comments related to presidential candidate debates. Three deep learning models—Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and a hybrid model combining CNN and Bi-LSTM—are assessed using confusion matrix, Area Under the Curve (AUC), and Hamming loss metrics. The evaluation results demonstrate that the Bi-LSTM model achieved the highest accuracy with an AUC value of 0.91 and a Hamming loss of 0.08, indicating an excellent ability to classify sentiment with high precision and a low error rate. This innovative approach to multi-label sentiment analysis in the context of the 2024 Indonesian presidential election expands the insights into public sentiment towards candidates, offering valuable implications for political campaign strategies. Additionally, this research contributes to the fields of natural language processing and data mining by addressing the challenges associated with multi-label sentiment analysis.

Keywords:

deep learning; multi-label sentiment analysis; Indonesian 2024 presidential elections

1. Introduction

Presidential elections represent the pinnacle of the political process, serving as the decisive mechanism through which leaders, responsible for shaping state policies in the forthcoming term, are selected. The enthusiasm of the community in the moment of the presidential election has always been in the spotlight and triggers intense debates such as debates related to political ideology, social and economic issues, dissatisfaction with the previous government, identity and religious issues, social media, and disinformation. In 2024, Indonesia held simultaneous general elections to select a presidential candidate. More than two hundred million voters went to the polling stations on 14 February 2024 to elect a President. Politically interested social media users share and seek information about politics on social media [1]. Political campaigns have exploited the wide range of information available on various social media platforms to gain insight into user opinions and thereby design campaign strategies. The huge investments made by politicians in social media campaigns just before general elections along with the arguments and debates between their supporters and opponents only enhance the claim that the views and opinions posted by users have an influence on the outcome of elections. Users have an influence on the outcome of elections [2], as evidenced by the rapid growth of studies on the impact of social media on presidential elections in Indonesia, where “Indonesian author’s studies on the presidential election in social media have experienced a rapid increase in recent years” [3].

Sentiment analysis is a branch of learning in the field of text mining that studies the analysis of opinions, emotions, feelings, attitudes, and evaluations expressed in the form of text [4,5]. In this research, we used YouTube for sentiment analysis of the 2024 Indonesian presidential and vice-presidential debates. YouTube is a popular platform among Indonesians. According to Global Media Insight, Indonesia ranks fourth globally in terms of number of YouTube users. Therefore, the YouTube platform can be regarded as a valuable medium for collecting public opinion on the 2024 elections. YouTube comments can capture immediate reactions from respondents or viewers without the need for real-time prompting [6].

Figure 1 shows YouTube search trends for the “Pemilu”, which means election. At the beginning of the period, the search trends showed a low score of 10. However, as time passed, there was a significant increase in this trend. The peak of the trend occurred in January 2024, with a score of 98 which signaled the peak of popularity in the number of searches in the same category. User comments, responses, and interactions on the YouTube platform can provide insight into views and opinions regarding presidential candidates, political issues, and election results.

One type of data that can be extracted from YouTube is comments [7]. The next step after extracting the data is to perform text mining using sentiment analysis. Sentiment analysis is conducted to detect and quantify emotional expressions in comments, including emotions such as anger, anticipation, disgust, joy, fear, sadness, surprise, and trust. The classification of comments into these eight emotions is important because emotion provides specific insight into how the public feels and reacts to a particular issue or candidate. These classifications offer several benefits, including helping political parties and candidates understand the public’s emotional landscape more comprehensively. By identifying various emotions, it is possible to uncover not just positive or negative sentiments but also the intensity and nature of public reactions. For example, anger or disgust can indicate dissatisfaction or rejection, while joy or trust can reflect support or confidence. Fear or sadness might highlight concerns or apprehension, while anticipation and surprise may signal engagement and expectations. In this study, multi-label classification is applied to a sentiment analysis of YouTube comments on the 2024 presidential candidate debates, offering a fresh perspective compared to traditional classification methods. Multi-label emotion classification allows for a more nuanced and comprehensive understanding of public sentiment toward the candidates. This approach broadens the insights gained from sentiment analysis and reveals a wider spectrum of emotional reactions, offering more depth compared to traditional sentiment classifications. Multi-label classification, including the classification of eight emotions, allows for more detailed and accurate sentiment analysis compared to binary or ternary classification [8,9,10]. Emotional sentiment analysis not only enables the exploration of public opinions but also influences the formation of public opinions towards the candidates [11]. Understanding the full spectrum of emotional reactions allows for the development of more targeted strategies, addressing public concerns and improving campaign effectiveness.

Emotion classification can be performed using deep learning algorithms, such as Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and a combination of CNN and Bi-LSTM. CNN is a suitable algorithm in this research due to the way CNN works in accordance with multi-label modeling that can extract local features from complex data. Modified CNN models for multi-label classification demonstrate varying degrees of effectiveness based on the number of labels and dataset complexity. These CNNs, which incorporate word embeddings followed by convolutional and dense layers, are employed to tackle challenges in extreme multi-label text classification problems [12]. LSTM is also applicable to multi-label data, with the capability to identify context and relationships within text. It can effectively capture long-term dependencies and understand the context in textual data [13]. Long short-term memory (LSTM) models have been effectively applied to enhance aspect-based sentiment analysis, significantly boosting accuracy by capturing sentiment that is dependent on context [14]. Bi-LSTM, an advanced version of LSTM, is crucial in addressing challenges in multi-label text classification. By integrating two LSTMs into a single model, Bi-LSTM can capture contextual information from both forward and backward directions [15]. On the other hand, CNN Bi-LSTM uses a combination of CNN and Bi-LSTM that combines CNN’s ability to capture patterns and LSTM’s ability to capture context [16]. The hybrid Bi-LSTM CNN model is considered to enhance text classification accuracy by integrating the LSTM and CNN models, along with attention mechanisms [17].

Therefore, the key research question explored in this study is how effective different deep learning models are in classifying these multi-label emotions in YouTube comments related to the 2024 Indonesian presidential and vice-presidential debates. This research is also limited in the use of models in performing classification, as only traditional models such as CNN, LSTM, and Bi-LSTM are used. However, the use of models in this study can be used in future research as a baseline in the context of politics in Indonesia, especially in the context of general elections.

2. Systematic Literature Review

In conducting this research, a systematic literature review (SLR) was employed to ensure that all relevant studies were identified, screened, and evaluated comprehensively. This approach was used to gather, evaluate, and synthesize existing research on research topics, ensuring that the conclusions drawn were based on the most robust evidence available [18]. Following the guidelines of the SLR methodology in Figure 2, this review was divided into several stages, including study identification, screening, eligibility assessment, and inclusion in the final synthesis.

The initial phase involved importing 604 studies from the Scopus database. These studies were collected as potential sources for this research, forming the basis of the screening process. During this stage, sixteen duplicate studies were identified and removed to avoid redundant analysis. Following the identification stage, five hundred studies remained for further screening. The screening process involved evaluating the relevance of these studies based on predefined inclusion and exclusion criteria. Studies that did not meet the inclusion criteria were removed. As a result, 297 studies were deemed irrelevant, while 203 studies were considered relevant for further assessment. After the screening phase, eleven full-text studies were assessed for eligibility. This phase ensured that each study was thoroughly reviewed based on its content, research findings, and relevance to the research topic. During this process, 192 studies were excluded for reasons such as being off topic, presenting unsuitable results, or lacking focus on the relevant research area.

Eleven studies were included in the final synthesis. These studies were considered to meet all the inclusion criteria and provided substantial contributions to understanding the research area. The final pool of studies forms the basis of the discussion and analysis presented in the subsequent chapters of this thesis. By employing systematic literature review methodology, this research ensures a structured and comprehensive analysis of the existing literature, thereby strengthening the conclusions and providing a well-founded basis for the exploration of the research area.

Based on the extraction shown in Table 1, the eleven studies included through the systematic literature review (SLR) process, this research offers several key contributions that distinguish it from previous studies.

While earlier research explored various approaches to sentiment analysis and multi-label classification using machine learning and deep learning techniques, this study specifically focuses on multi-label sentiment analysis for emotion classification related to the Indonesian presidential election. Previous studies, such as Wisnubroto (2022) [19] and Mandhasiya (2022) [20], concentrated on developing models for opinion-based classification, while Jabreel (2019) [21] delved into deep learning techniques for different predictive tasks. Additionally, the work by He (2018) [23] examined binary techniques for complex data classification.

However, the novelty of this research lies in its specific application of multi-label emotion classification using deep learning models like CNN, Bi-LSTM, and a hybrid CNN-BiLSTM model, tailored for sentiment analysis in the context of Indonesian presidential candidates. This study leverages a unique dataset consisting of online reviews and comments directly related to the political contest in Indonesia. In contrast, previous studies, such as those by Tripto (2018) [24] and Samy (2018) [25], did not specifically focus on multi-label approaches or the political domain, let alone in the context of the Indonesian presidential election.

The prominent difference between this research and the papers listed in the systematic literature review (SLR) table is the use of a multi-label sentiment analysis approach in the context of the 2024 Indonesian presidential election, implemented through a combination of deep learning models (CNN, Bi-LSTM, CNN-BiLSTM). Most previous studies included in the table focus more on traditional algorithms such as Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM), and do not explicitly employ a multi-label approach, instead opting for binary or multi-class classification. Some studies also adopt deep learning models like LSTM and BERT; however, they do not implement the complexity of multi-label analysis with various emotions and candidates as applied in this research.

The primary novelty of this research lies in the use of CNN-BiLSTM, which combines the strengths of convolutional networks and long short-term memory networks to detect more complex patterns in YouTube comments related to the 2024 election. While transformer models like BERT leverage attention mechanisms to capture contextual relationships and dependencies across sequences, the CNN and Bi-LSTM models used in this research focus on extracting spatial features and capturing long-term dependencies, respectively. This hybrid approach allows for a more nuanced analysis of sentiment by incorporating both local and temporal contexts. Furthermore, this research achieves a very high AUC of 0.91, indicating superior model performance in classifying multi-label sentiments regarding presidential candidates. Other approaches in prior studies have yet to implement multi-label analysis with such high precision for similar scenarios. Thus, the main novelty of this research is the application of a multi-label sentiment analysis method using more complex deep learning models and superior performance evaluation results compared to previous studies.

While this research also utilizes multi-label approaches as well as traditional deep learning models such as CNN, LSTM, and Bi-LSTM that have been explored previously, research into the political context of Indonesia—specifically the Indonesian presidential election and the use of YouTube for data collection—will provide unique challenges. This research not only provides an analysis of emotional reactions to voters, but also provides information to formulate more effective campaign approaches in the political dynamics in Indonesia.

Therefore, the novelty of this research is highlighted not only by the methodology and the combination of deep learning models employed but also by the specific application of these methods to political sentiment analysis in Indonesia. This research aims to provide new insights and a more accurate model for analyzing political sentiment in the digital age, especially in the context of elections. This study focuses on the local political context of the 2024 Indonesian presidential election and employs a multi-label classification of emotions, ensuring that there is no duplication with previous research in this area.

3. Methodology and Implementation

The research method used in this study is the Cross-Industry Standard Process for Data Mining (CRISP-DM). This method provides a structured approach to planning and executing data mining tasks and is known for its adaptability across multiple sectors and data-intensive applications [29]. The Cross-Industry Standard Process for Data Mining (CRISP-DM) method is applied in the sentiment analysis research of YouTube comments for the 2024 Indonesian presidential election as shown in Figure 3. The CRISP-DM method is particularly suitable for this research because it emphasizes a comprehensive understanding of the problem domain, which is crucial for analyzing sentiments related to political candidates. Its iterative approach allows for continuous refinement of models based on feedback and evolving insights, making it ideal for the dynamic nature of public opinion analysis in the context of elections. Additionally, the structured phases of CRISP-DM facilitate effective collaboration among stakeholders and enhance the overall quality of the research outcomes.

The initial stage, business understanding, involves identifying the problem based on business objectives and integrating modeling objectives with business strategy. It focused on gathering insights into public opinion through YouTube comments from presidential candidate debates, such as those involving Anies Baswedan, Prabowo Subianto, and Ganjar Pranowo. Various machine learning models were tested to classify sentiments and select the best one for the automatic prediction of emotions related to each candidate.

Furthermore, data understanding entailed the systematic collection of data through YouTube APIs of key debates, ensuring that the data collected covered a wide range of perspectives and emotions arising from public comments. This process was critical to understanding the context and dynamics of evolving opinions during the campaign period. Finally, data preparation, modeling, evaluation, and deployment involved deep data processing, including normalization, the removal of unimportant characters, and sentiment labeling. The developed model was then evaluated for its effectiveness and, if valid, deployed for further predictions. These stages ensure that the sentiment analysis conducted is not only accurate but also relevant for strategic purposes in the presidential election.

3.1. Business Understanding

The CRISP-DM methodology begins with the “business understanding” phase. In this stage, information is gathered from business objectives, and a decision is made on how to align the goals of modeling with business targets [17]. Analyzing the emotional sentiment expressed in YouTube comments about the 2024 Indonesian presidential election plays a vital role in gaining insight into public opinion and shaping perspectives about the presidential candidate. This study leverages different machine learning models for deployment. The selected mode is used to make automatic predictions on new datasets, allowing for a comparison of emotional labels associated with each candidate pair.

3.2. Data Understanding

YouTube comments related to the presidential candidates Anies Baswedan, Prabowo Subianto, and Ganjar Pranowo serve as the data source for sentiment analysis. The data were systematically collected through YouTube API from specific videos featuring these candidates during debate events. Comments were gathered from the video titled “Debat Capres dan Cawapres”, the mean presidential and vice-presidential debate, uploaded on both the Komisi Pemilihan Umum (KPU) and Najwa Shihab channels. The data collection spanned multiple dates between 12 December 2023 and 4 February 2024, including 22 December 2023, 7 January 2024, 21 January 2024, and 4 February 2024. The final dataset contains 32,000 entries, offering comprehensive coverage of a range of emotions and public opinions in the lead-up to the 2024 Indonesian presidential election [30].

The large and diverse dataset provides a solid basis for training and validating the multi-label emotion classification model, with the variety of entries enhancing the model’s generalizability and improving its accuracy in predicting emotional responses [30].

In this study, 32,000 data points were taken from the nine most viewed videos by Indonesians on YouTube, where the nine videos as mentioned earlier were videos from the KPU RI and Najwa Shihab channels. KPU RI is the official channel of the general election commission in Indonesia, while Najwa Shihab is one of the journalists from Indonesia who has a large base. Comment data taken from these nine videos are imbalanced and can cause representation bias. The total distribution of the comment data can be seen in Figure 4.

In Figure 4, there are nine videos with information such as KPU and NS. KPU is a video that comes from the KPU RI channel with the number after it indicating the video number. NS is a video that comes from the Najwa Shihab channel and the number after it also indicates the video number. From the distribution of comment data, the Najwa Shihab channel has much more data than the KPU RI channel; therefore, to ensure that there is no biased representation, data imbalance handling was performed. Therefore, in the next step there a class weight is used to balance the comment data that was taken.

3.3. Data Preparation

To ensure the quality and relevance of the data used in sentiment analysis, the raw data underwent a series of detailed preprocessing steps. These steps include:

Normalization and Labelling:
With the help of GPT-3.5, sentiment labeling was performed to ensure the consistent classification of emotions for each data point. This ensures accuracy in the sentiment data labeling process. The prompt utilized for labeling the data with GPT-3.5 is illustrated in Figure 5. The timeline for data crawling spanned from December 2023 to February 2024, while the sentiment labeling process occurred in February 2024. GPT-3.5 provides numerous benefits, such as quicker response generation, reduced computational resource demands, and cost-effectiveness for various applications.
Unwanted Character Removal: Using regular expressions, unwanted elements such as URLs, special characters, and numbers were removed. These elements do not contribute meaningful information for sentiment analysis and can disrupt model interpretation.
Case Folding: All text was converted to lowercase to prevent duplicate entries caused by differences in capitalization. This step helps homogenize the data, making them easier for machine learning algorithms to process.
Stopwords Removal: Common words that do not carry significant weight for sentiment analysis, such as “and”, “or”, and “but” were removed. This allows the analysis to focus on words that have higher emotional impact.
Text Abbreviation Correction: Abbreviated words were expanded to their full forms, ensuring that the analysis captures the complete meaning of each word and expression.
Text Augmentation: Text data augmentation involves techniques used in natural language processing to enrich the dataset by creating variations of the existing data. This may include applying transformations or modifications to expand the diversity of the data, which enhances the model’s ability to learn from a broader range of situations and generalize to new or unseen texts. By applying random swap and random deletion techniques, this step effectively increases the variation in the dataset. As a result, the dataset was expanded from 32,000 instances to 96,000 instances, which aids in better training of the machine learning model. The increase in data from these augmentation techniques allows the model to better recognize various patterns and variations in the text, thereby improving its performance in sentiment analysis tasks.

These preprocessing steps, which can be seen in Table 2, collectively refine the dataset, ensuring that it is well-prepared for accurate and insightful sentiment analysis. As a result, the effectiveness of the model is enhanced, leading to more precise outcomes.

3.4. Pre-Trained Model

Before training the model, several pre-training procedures are necessary, including data tokenization and padding. Tokenization breaks down text into smaller units, such as words, numbers, or symbols, known as tokens. This step is vital for text analysis as it allows algorithms to process and understand each data element more efficiently. Padding involves adding specific values (typically zeros) to data sequences, ensuring uniform length across all entries. This step is essential for preparing data for machine learning models that require input with consistent dimensions.

In this study, an imbalance in the number of training examples across different classes introduces a potential bias in the model. To address this issue, class weights are applied to the loss function. Some labels have very high frequencies, while others have very low frequencies. This is a clear indication of class imbalance that needs to be addressed for the model to learn effectively from all classes. Class weights are calculated based on the frequency of each class in the training data. Underrepresented classes are given greater weights, ensuring that the model pays more attention to these classes during training. These weights assign greater importance to misclassifications of underrepresented classes in the training data. The weights are determined based on the inverse frequency of each class, helping to enhance the model’s ability to classify all classes more equitably. By applying class weights, the model becomes more balanced and accurate in handling diverse classes. This approach is especially effective in improving classification performance on imbalanced datasets by distinguishing between majority and minority classes accurately [31,32].

3.5. Modelling

In this subchapter, modeling is performed using three main architectures, namely CNN, Bi-LSTM, and CNN Bi-LSTM combination. The purpose of this modeling is to evaluate and compare the performance of each model in the multi-label sentiment analysis task, to determine the most effective and best performing model. The three models will be trained and tested using processed data, and their performance will be measured using relevant evaluation metrics.

3.5.1. CNN Implementation

At this stage, the author builds and trains a Convolutional Neural Network (CNN) model. CNN is a type of artificial neural network specifically designed for processing data such as images and text. In the context of text processing, CNN is used to capture spatial features from text, such as phrases or word patterns that are important for sentiment analysis tasks. CNNs work by using convolutional layers that apply filters to the input data to produce a feature map, which is then further processed through pooling layers and fully connected layers to produce the final prediction. A primary metric for assessing the effectiveness of algorithm implementation is Performance Evaluation, with a particular focus on accuracy [33]. The architecture of CNN can be seen in Table 3.

The model is built using a Convolutional Neural Network (CNN) architecture for multi-label sentiment analysis. First, an embedding layer converts words in the text into low-dimensional vectors with a dimension of sixty-four, helping the model understand the context of words in the vector space. Next, two Conv1D layers are used to extract features from the text, each with 128 filters and a kernel size of five, utilizing ReLU activation to introduce non-linearity. After each convolutional layer, a MaxPooling1D layer with a pooling size of five is applied to reduce the output dimensions and take the maximum value from a specific window. A GlobalMaxPooling1D layer is then used to transform the output into a 1D vector, representing the most significant features from the entire text. Two dropout layers with a dropout rate of 10% are applied to reduce overfitting by randomly deactivating a portion of the neurons during training. A dense layer with 128 units and ReLU activation is used to learn patterns from the extracted features, followed by an output layer with sigmoid activation that generates probabilities for each label in the multi-label task. This approach enables the CNN model to effectively extract features from text and perform multi-label sentiment analysis tasks with high accuracy while reducing the risk of overfitting using dropout layers.

The CNN model used the binary_crossentropy loss function, Adam optimizer with exponential learning rate decay, and metrics like accuracy, AUC, precision, and recall. An early stopping callback was applied to halt training early if there was no improvement in val_loss after 20 epochs, reverting to the best weights found. The model was then trained using the training data (X_train and y_train) and validated with the test data (X_test and y_test). The training ran for up to 50 epochs with a batch size of 128, and class_weight addressed data imbalance. This architecture enables the CNN model to accurately perform multi-label sentiment analysis by effectively extracting text features.

From the results of the CNN model that has been trained, it produces an equation in general form as in Equation (1). To calculate the feature map

F M

manually after training, the author takes the example of calculating the element

{F M}_{1,1}

, for the first filter. With input

X

having a dimension of 50 and kernel Z of size (640 × 128) for the second Conv1D layer, the author used the first part of input

X

corresponding to the kernel size. In this context, kernel Z has a filter length of five, several input channels of 128 (corresponding to the output embedding dimension), and several output channels of 128. Each filter in the Conv1D layer has a kernel dimension of 5 × 128. That is, each filter has five elements that extend in the sequential direction (from top to bottom) and 128 elements that cover the entire input channel (from left to right). Since there are 128 filters, the Z kernel for the second Conv1D layer has a total size of 640 × 128 (640 rows and 128 columns). The Z kernel for the first filter has the following values:

{F M}_{a, b} = b i a s + \sum_{c}^{C} \sum_{d}^{D} Z_{c, d} \times X_{a + c - 1, b + d - 1}

(1)

where:

${F M}_{a, b}$ = feature map ke- $a, b$ ;
$b i a s$ = bias on the feature map;
$Z_{c, d}$ = weigh on convolution kernel ke-c,d;
$X$ = input;
$a$ = 1, 2…, A. A is the pixel length of pixel feature map;
$b$ = 1, 2…, B. B is the width pixel feature map;
$c$ = 1, 2…, C. C is the pixel length of convolutional kernel;
$d$ = 1, 2…, D. D is the width convolutional kernel.

${Z = [\begin{matrix} 0.050685 & - 0.067715 \\ \begin{matrix} 0.021489 \\ \begin{matrix} 0.008716 \\ ⋮ \\ \begin{matrix} 0.0011773 \end{matrix} \end{matrix} \end{matrix} & \begin{matrix} - 0.063418 \\ \begin{matrix} 0.036468 \\ ⋮ \\ 0.015347 \end{matrix} \end{matrix} \end{matrix} \begin{matrix} - 0.039530 \\ \begin{matrix} 0.025046 \\ - 0.017816 \\ ⋮ \end{matrix} \\ 0.014514 \end{matrix} \begin{matrix} \dots \\ \begin{matrix} \dots \\ \dots \\ ⋱ \end{matrix} \\ \dots \end{matrix} \begin{matrix} - 0.071520 \\ \begin{matrix} - 0.008875 \\ 0.029493 \\ ⋮ \end{matrix} \\ - 0.011156 \end{matrix}]}_{640 \times 128}$

The calculation of

{F M}_{1,1}

is performed by summing up the product of each kernel element with the corresponding input element, then adding the bias. In this case, the bias value is 0.096450. Therefore, the equation becomes:

{F M}_{1,1} = 0.096450 + \sum_{c = 1}^{5} \sum_{d = 1}^{128} {(Z}_{c, d} \times X_{a + c - 1, b + d - 1})

By inputting the kernel values and input (

X

) as:

X = [0,0, 0,0, \dots, 2532]

{F M}_{1,1} = 0.096450 + (0.050685 \times 0) + (- 0.067715 \times 0) + (0.003102 \times 0) + \dots + (- 0.071520 \times 2532) = - 181.05419

Thus, the value of

{F M}_{1,1}

for the first filter is −181.05419. This negative value reflects that the combination of input elements passed through the kernel produces a negative output for position (1, 1) on the feature map. Similar steps are performed for all filters (128 filters) and all positions (ab) on input

X

to obtain the entire feature map.

3.5.2. Bi-LSTM Implementation

At this stage, this research will build and train the Bidirectional Long-Short Term Memory (Bi-LSTM) model. Bi-LSTM is a type of recursive neural network used to process sequential data such as text. LSTMs can remember information from long sequences of data, while the architecture of Bi-LSTM allows the network to learn from both directions, improving the model’s ability to capture long-term dependencies in text. In the context of sentiment analysis, Bi-LSTM can capture the context of the previous and next words more effectively than regular LSTM. The architecture of Bi-LSTM can be seen in Table 4.

The model is built using Bidirectional Long Short-Term Memory (Bi-LSTM) architecture for multi-label sentiment analysis. First, an embedding layer converts words in the text into low-dimensional vectors with a dimension of sixty-four, helping the model understand the context of words in the vector space. Next, a dropout layer with a dropout rate of 10% is applied to reduce overfitting by randomly deactivating a portion of the neurons during training. Two Bidirectional LSTM layers are used to capture long-term dependencies in the text from both directions (from start to end and vice versa), each with 16 units; the first layer returns the full sequence of outputs (return_sequences = True), while the second layer returns only the last output from the sequence. After each LSTM layer, a dropout layer is applied to further mitigate overfitting. Finally, a dense layer with sigmoid activation is used to generate probabilities for each label in the multi-label task, allowing the model to handle text with more than one label. An additional dropout layer before the output helps maintain the model’s generalization and reduce the risk of overfitting.

The Bi-LSTM model was compiled with the binary_crossentropy loss function, Adam optimizer, and metrics such as accuracy, AUC, precision, and recall. An early stopping callback terminated training if val_loss did not improve after 20 epochs, reverting to the best weights. The model was trained on X_train and y_train, and validated with X_test and y_test, for up to 50 epochs with a batch size of 128. Class_weight was used to address class imbalance. This configuration allows the Bi-LSTM to capture long-term dependencies in text and perform multi-label sentiment analysis with high accuracy.

From the results of the Bi-LSTM model that has been trained, it produces a general equation as in Equations (2)–(5). To calculate the gate manually after training, the author takes an example of calculation in the first timestep. The relevant data for gate calculation is as follows:

Forget gate in the first timestep is calculated using Equation (2) with forget gate weights that have a matrix size of 64 × 16. The size of this matrix is determined by a combination of the input embedding dimension of sixty-four and the number of LSTM units of sixteen. In other words, each row in the forgettable weight matrix (64 rows in total) corresponds to the features of the input and the previous state, while each column (16 columns in total) corresponds to a unit in the LSTM. To calculate the forget gate (

f_{t}

) at the first timestep, Equation (2) is used with the forget weight as follows:

f_{t} = σ (W_{f} . [h_{t - 1}, x_{t}] + b_{f})

(2)

where:

$σ$ = Sigmoid function that produces an output between 0 and 1, determining how much information should be forgotten or retained.
$W_{f}$ = Weight matrix for forget gate.
$h_{t - 1}$ = Hidden state of the previous timestep.
$x_{t}$ = Input at the current timestep.
$b_{f}$ = Bias for forget gate.

This research uses the first part of the input

[h_{t - 1}, x_{t}]

and the weight and bias values as follows:

Input data $([h_{t - 1}, x_{t}]) : [$ 0, 1, 2, …, 2532]
Bias ${(b}_{f}) : [0.00410211; - 0.0967836; - 0.07344416; \dots; - 0.05566359; 0.1132111; 0.03457972]$

The values of the weights

W_{f}

for the forget gate are as follows:

W_{f} = [- 0.25230718; - 0.6338949; - 0.35867327; \dots; - 0.6317019; - 0.09218203]

The calculation of

f_{t}

is performed by summing up the product of each weight element with its corresponding input element, then adding the bias. In this case, the equation becomes:

f_{t} = σ (W_{f} \times [h_{t - 1}, x_{t}] + b_{f})

Entering the values of the weights and inputs, the calculation is as follows:

f_{t} = σ ((- 0.252307 \times 0) + (- 0.633895 \times 1) + (- 0.358673 \times 2) + \dots + (- 0.631702 \times 2532) + 0.004102)

Substitute

f

to the sigmoid function, the results obtained are as follows:

σ (x) = \frac{1}{1 + e^{- x}}

The results are as follows:

f_{t} = σ ([- 3191.1803; \dots; - 1597.64402; 186.3824; \dots .]) \approx 0 (v e r y s m a l l v a l u e)

Thus, the value of

f_{t}

for the forget gate at the first timestep is almost close to zero.

The gate input in the first timestep is calculated using Equation (3), with a gate input weight that has a matrix size of 64 × 16. The size of this matrix is determined by the combined embedding input of 64 and the number of LSTM units of 16. To calculate the gate input (

i_{t}

) in the first timestep, Equation (3) is used with the input weight as follows:

i_{t} = σ (W_{i} . [h_{t - 1,} x_{t}] + b_{i})

(3)

where:

$σ$ = Sigmoid function.
$W_{i}$ = Weight matrix for input gate.
$b_{i}$ = Bias for input gate.

$W_{i} = {[\begin{matrix} \begin{matrix} 0.478586 & - 0.190812 & - 0.553236 \\ - 0.074447 & - 0.296983 & - 0.778060 \\ \begin{matrix} - 0.455351 \\ ⋮ \end{matrix} & \begin{matrix} - 0.559245 \\ ⋮ \end{matrix} & \begin{matrix} 0.572737 \\ ⋮ \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} \dots \\ \dots \\ \begin{matrix} \dots \\ ⋱ \end{matrix} \end{matrix} & \begin{matrix} 0.224637 \\ - 0.259077 \\ \begin{matrix} - 0.583992 \\ \dots \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} - 0.454551 & - 0.050116 & 0.721157 \end{matrix} \begin{matrix} \dots & \begin{matrix} - 0.093608 \end{matrix} \end{matrix} \end{matrix}]}_{64 \times 16}$

This research uses the first part of the input

[h_{t - 1}, x_{t}]

and the weight and bias values as follows:

Input data $([h_{t - 1}, x_{t}]) : [0,1, 2, \dots, 2532]$
Bias ${(b}_{i}) : [1.120849; 1.161612; 1.136956; \dots; 1.015041]$

The weight values

W_{i}

for the input gate are as follows:

W_{i} = [0.478585; - 0.190812; - 0.553236; \dots; 0.224637] .

The calculation of

i_{t}

is performed by summing up the product of each weight element with its corresponding input element, then adding the bias. In this case, the equation becomes:

i_{t} = σ (W_{i} \times [h_{t - 1}, x_{t}] + b_{i}) .

Entering the values of the weights and inputs, the calculation is as follows:

i_{t} = σ ((0.478585 \times 0) + (- 0.190812 \times 1) + (- 0.553236 \times 2) + \dots + (0.224637 \times 2532) + 1.120849) .

Substituting

i_{t}

to the sigmoid function, the results obtained are as follows:

i_{t} = σ ([567.381; \dots; - 139.78; \dots]) \approx 1 (v e r y b i g v a l u e)

Thus, the value of

i_{t}

for the input gate at the first timestep is almost close to one.

Candidate gate in the first timestep is calculated using Equation (4) with candidate gate weights as in the gate input matrix and forget gate with size 80 × 16. To calculate the candidate gate (

c_{t}

) in the first timestep, Equation (4) is used with the input weight as follows:

{\tilde{C}}_{t} = \tanh (W_{c} . [h_{t - 1,} x_{t}] + b_{c})

(4)

where:

$t a n h$ = The hyperbolic tangent function, which produces an output between 1 and 1, is used in the candidate gate to help scale the data.
$W_{c}$ = Weight matrix for candidate gate.
$b_{c}$ = Bias for input gate and candidate gate.

$W_{c} = {[\begin{matrix} \begin{matrix} - 0.210637 & 0.582318 & 0.026176 \\ - 0.407941 & - 0.418978 & - 0.364767 \\ \begin{matrix} 0.263948 \\ ⋮ \end{matrix} & \begin{matrix} 0.193731 \\ ⋮ \end{matrix} & \begin{matrix} - 0.001894 \\ ⋮ \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} \dots \\ \dots \\ \begin{matrix} \dots \\ ⋱ \end{matrix} \end{matrix} & \begin{matrix} 0.050023 \\ - 0.150183 \\ \begin{matrix} 0.050437 \\ ⋮ \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} 0.263948 & - 0.218638 & - 0.183733 \end{matrix} \begin{matrix} \dots & \begin{matrix} 0.274419 \end{matrix} \end{matrix} \end{matrix}]}_{80 \times 16}$

This research uses the first part of the input

[h_{t - 1}, x_{t}]

and the weight and bias values as follows:

Input data ( $[h_{t - 1}, x_{t}]$ ): [0, 1, 2, …, 2532]
Bias $(b_{c}) : [- 1.025372; 6.291897; 9.400418; \dots; 1.878555]$

The values of the weights

W_{c}

for the candidate gates are as follows:

W_{c} = [- 0.210637; 0.582318; 0.026176; \dots; - 0.578022; 0.050022]

The calculation of

{\tilde{C}}_{t}

is performed by summing up the product of each weight element with its corresponding input element, then adding the bias. In this case, the equation becomes:

{\tilde{C}}_{t} = t a n h (W_{c} \times [h_{t - 1}, x_{1}] + b_{c})

Entering the values of the weights and inputs, the calculation is as follows:

{\tilde{C}}_{t} = t a n h ((- 0.210637 \times 0) + (0.582318 \times 1) + (0.026176 \times 2) + \dots + (0.050022 \times 2532) + - 1.025372)

Substituting

{\tilde{C}}_{t}

to the tanh function, the results obtained are as follows:

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

{\tilde{C}}_{t} = \tanh ([105.28; \dots; - 41.50; \dots]) \approx 1 (v e r y b i g v a l u e)

Thus, the value of

{\tilde{C}}_{t}

for the candidate’s gate in the first timestep is close to one.

The output gate in the first timestep is calculated using Equation (5) with candidate gate weights as in the input gate, forget gate, and candidate gate matrix with a size of 64 × 16. To calculate the candidate gate (

o_{t}

) in the first timestep, Equation (5) is used with the input weight as follows:

o_{t} = σ (W_{o} . [h_{t - 1}, x_{t}] + b_{o})

(5)

where:

$σ$ = Sigmoid function.
$W_{o}$ = Weight matrix for output gate.
$b_{o}$ = Bias for output gate.
$C_{t}$ = The cell state at the current timestep, which is a combination of the old information and the new updated information.
$h_{t}$ = The hidden state at the current timestep is used as the output of the LSTM at this timestep.

$W_{o} = {[\begin{matrix} \begin{matrix} - 0.093755 & - 0.261583 & - 0.629266 \\ 0.719980 & - 0.594796 & - 0.198586 \\ \begin{matrix} - 0.438914 \\ ⋮ \end{matrix} & \begin{matrix} - 0.094007 \\ ⋮ \end{matrix} & \begin{matrix} 0.206499 \\ ⋮ \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} \dots \\ \dots \\ \begin{matrix} \dots \\ ⋱ \end{matrix} \end{matrix} & \begin{matrix} 0.380140 \\ 0.176501 \\ \begin{matrix} 0.064923 \\ ⋮ \end{matrix} \end{matrix} \end{matrix} \\ \begin{matrix} 0.167823 & 0.303261 & - 0.180812 \end{matrix} \begin{matrix} \dots & \begin{matrix} - 0.458705 \end{matrix} \end{matrix} \end{matrix}]}_{\begin{matrix} 64 \times 16 \end{matrix}}$

This research uses the first part of the input

[h_{t - 1}, x_{t}]

and the weight and bias values as follows:

W_{o} = [- 0.093755; - 0.261583; - 0.629266; \dots; - 0.046914; 0.380140]

The calculation of

o_{t}

is performed by summing up the product of each weight element with its corresponding input element, then adding the bias. In this case, the equation becomes:

o_{t} = σ (W_{o} \times [h_{t - 1}, x_{1}] + b_{o})

Entering the values of the weights and inputs, the calculation is as follows:

o_{t} = σ ((- 0.093755 \times 0) + (- 0.261583 \times 1) + (- 0.629266 \times 2) + \dots + (0.380140 \times 2532) + 0.048839)

Substituting

o_{t}

to the tanh function, the results obtained are as follows:

o_{t} = σ ([902.74; \dots; - 124.31; \dots]) \approx 1 (v e r y b i g v a l u e)

Thus, the value of

o_{t}

for the gate output at the first timestep is close to one. Similar steps are performed for all units and all timesteps at the input

[h_{t - 1}, x_{t}]

to obtain the overall gate value.

Gate values close to 0 or 1 indicate activation or non-activation of the gate. A value close to one indicates the gate is active or open, while a value close to zero indicates the gate is inactive or closed.

3.5.3. CNN Bi-LSTM Implementation

At this stage, this research will build and train a combined CNN and Bi-LSTM model. The CNN Bi-LSTM model combines the advantages of Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (Bi-LSTM) for multi-label sentiment analysis. CNN is used to extract spatial features from text, while Bi-LSTM is used to capture long-term dependencies in text from both directions. The architecture of CNN Bi-LSTM can be seen in Table 5.

The model is built using CNN-Bi-LSTM architecture for multi-label sentiment analysis. First, it begins with an embedding layer that is used to convert words in the text into low-dimensional vectors with a dimension of sixty-four, helping the model understand the context of the words in the vector space. Next, a dropout layer with a dropout rate of 10% is applied to reduce overfitting. A Conv1D layer with sixteen filters and a kernel size of five is utilized to extract spatial features from the text, followed by a MaxPooling1D layer with a pooling size of four to reduce the output dimensions from the convolutional layer. Two Bidirectional LSTM layers with sixteen units are implemented to capture long-term dependencies in the text from both directions. The first layer returns the full sequence of outputs, while the second layer returns only the last output from the sequence. Following this, dropout layers with a 10% dropout rate are applied to reduce overfitting. A dense layer with sigmoid activation is used as the output layer to produce probabilities for each label in the multi-label task.

The model used the binary_crossentropy loss function, Adam optimizer with learning rate decay, and metrics like accuracy, AUC, precision, and recall. It trained on X_train and y_train and validated on X_test and y_test. Training ran for up to 50 epochs with a batch size of 128, addressing class imbalance with class_weight. Early stopping prevented overfitting by halting if val_loss did not improve after 20 epochs, restoring the best weights.

4. Result and Discussion

The Results and Discussion chapter provide a comprehensive evaluation of the three models: CNN, Bi-LSTM, and CNN Bi-LSTM. An analysis based on evaluation metrics, such as accuracy, precision, recall, F1-score, AUC, and Hamming loss, are provided to evaluate each model’s performance in multi-label sentiment classification. This chapter discusses the strengths and weaknesses of each model, offering deeper insights into the effectiveness of deep learning approaches for sentiment analysis in the context of the Indonesian presidential election. This comparison aims to offer valuable guidance for selecting the most suitable model for similar applications in the future.

4.1. Confusion Matrix

The confusion matrix is an evaluation tool that displays the comparison between model predictions and actual labels in the form of a matrix. This matrix shows the number of True Positives (TP), True Negatives (TN), False Positives (FP), and False Positives (FN) for each class. The confusion matrix helps in understanding the distribution of prediction errors and how well the model classifies each class.

4.1.1. CNN

The CNN confusion matrix in Figure 6 indicates that the CNN model has varying levels of performance. The model demonstrates high accuracy, precision, recall, and F1-score in detecting presidential candidate entity labels such as Anies, Prabowo, and Ganjar. However, it encounters difficulties in classifying labels such as anticipation and surprise, evidenced by a high number of errors in True Positives (TP) and False Positives (FP). Based on the confusion matrix in Figure 6, researchers can evaluate the performance of the CNN model in multi-label classification across various emotions and candidate labels.

Table 6 shows that the CNN model has very high accuracy for Anies, Prabowo, and Ganjar labels, with values of 95.91%, 96.91%, and 98.65%, respectively. This shows that the CNN model is very good at classifying these three labels. The fear and sadness labels also have high accuracy above 94%, showing good performance in identifying these two emotions. Labels with low accuracy such as joy with an accuracy of 77.66% and trust with an accuracy of 81.66% show that the model can still make improvements in identifying these emotions better.

The highest precision can be seen in the Prabowo label at 98.27% and Anies at 97.05%, indicating that the model is very good at reducing False Positives (FP) on these two labels. The anticipation label has a very low precision of 32.27%, indicating that many False Positives (FP) occur, so the model often misclassifies the anticipation label as another label.

The highest recall is also seen for the Anies and Prabowo labels at 86.50% and 87.66%, respectively, indicating that the model is very good at detecting almost all correct instances of these two labels. The sadness and anticipation labels have low recalls of 40.65% and 48.26%, respectively, indicating that the model often misses instances of these two labels.

Overall, the CNN model shows very good performance on certain labels, especially on labels such as Anies, Prabowo, and Ganjar. However, it still requires optimization, especially on emotion labels such as anticipation and surprise, where the precision and recall are still low. This indicates the need for further optimization of the model to improve more accurate and consistent classification across labels.

4.1.2. Bi-LSTM

The Bi-LSTM confusion matrix in Figure 7 shows that the Bi-LSTM model has a very good performance in detecting certain labels such as Anies, Prabowo, and Ganjar with a small number of errors. The Bi-LSTM model also shows good performance on emotion labels such as sadness and fear. However, the model has difficulty in detecting emotions such as disgust and anticipation.

The confusion matrix in Figure 7 allows researchers to assess the Bi-LSTM model’s performance in multi-label classification of emotions and candidate labels. Accuracy, precision, and recall results are shown in Table 7.

In Table 7, the Bi-LSTM model demonstrates exceptionally high accuracy for the labels Anies, Prabowo, and Ganjar, with values of 99.28%, 99.54%, and 99.64%, respectively. This suggests that the model is highly effective in classifying these three labels. Additionally, the model exhibits strong performance in identifying emotions such as fear and sadness, with accuracy exceeding 95%. However, there is room for improvement in recognizing emotions like joy and trust, which have lower accuracies of 81.98% and 82.92%, respectively.

The highest precision is observed in the Prabowo label at 99.18% and the Anies label at 99.66%, demonstrating the model’s effectiveness in minimizing False Positives (FP) for these two labels. In contrast, the anticipation label exhibits a lower precision of 47.76%, indicating a higher occurrence of False Positives and frequent misclassification of anticipation as another label. The labels disgust, joy, and trust have precision scores ranging from approximately 70% to 85%, suggesting that the model is reasonably effective in reducing False Positive (FP) errors for these labels.

The highest recall is also seen for the Anies label at 97.50% and Prabowo at 98.13%, indicating that the model is very good at detecting almost all correct instances of these two labels; the surprise label has a lower recall of 35.71%, indicating that the model often fails to detect this label The sadness and anticipation labels also have relatively low recalls of 42.69% and 46.74%, respectively, indicating that the model often misses instances of these two labels.

Overall, the Bi-LSTM model shows very good performance on certain labels, especially on Anies, Prabowo, and Ganjar. However, it still needs improvement on other labels, especially on emotions such as anticipation and surprise, where precision and recall are still relatively low.

4.1.3. CNN Bi-LSTM

The confusion matrix in Figure 8 shows that the CNN Bi-LSTM model has a very good performance in detecting certain labels such as Anies, Prabowo, and Ganjar with a minimal amount of error. The model also performs well on emotions such as sadness and fear. However, the model has greater difficulty in detecting emotions such as disgust, anticipation, and joy, as shown by the high number of False Negatives (FN) and False Positives (FP).

The confusion matrix in Figure 8 allows for the evaluation of the Bi-LSTM model’s performance in multi-label classification across various emotions and candidate labels. The calculation results of accuracy, precision, and recall are presented in Table 8.

In Table 8, the CNN Bi-LSTM model shows high accuracy for the Anies (93.43%), Prabowo (94.28%), and Ganjar (98.03%) labels, indicating strong classification performance. It also performs well in identifying fear and sadness, with accuracy above 94%. However, lower accuracy for joy (74.95%) and trust (79.41%) suggest room for improvement in detecting these emotions.

The highest precision is seen in the Prabowo label with a score of 97.10% and Anies with 96.50%, indicating that the model is very good at reducing False Positives (FP) on these two labels. The anticipation label has a lower precision 41.05, indicating that there are many False Positives (FP) that occur, so the model often misclassifies anticipation as another label. For the labels disgust, joy, and trust, the precision is around 70–80%, indicating that the model is quite effective in reducing False Positive (FP) misclassification.

The highest recall is also seen for the Anies label with a score of 76.90% and Prabowo with 76.62%, indicating that the model is very good at detecting almost all correct in-stances of these two labels. The surprise label has a lower recall of 23.92%, indicating that the model often fails to detect the surprise label. The sadness and anticipation labels also have low recalls of 30.61% and 40.87%, respectively, indicating that the model often misses the instances of these two labels.

Overall, the CNN Bi-LSTM model shows excellent performance on certain labels, especially on Anies, Prabowo, and Ganjar. However, it requires improvement on other labels, especially on emotions such as anticipation and surprise, where precision and recall are still relatively low.

4.2. AUC

Area Under Curve (AUC) is a metric that indicates the ability of the model to distinguish between classes. The ROC (Receiver Operating Characteristic) curve illustrates the trade-off between true positive rate (TPR) and false positive rate (FPR) at various thresholds. The AUC provides a value between 0 and 1, where values close to one indicate a model that is good at predicting classification. The performance evaluation of the CNN, Bi-LSTM, and CNN-BiLSTM models in emotion classification and entity detection of presidential candidates show varying results in Figure 9. The CNN model achieved an average AUC value of 0.89, reflecting a good ability to classify emotions. The emotion fear recorded the highest AUC with a value of 0.91, while trust followed with an AUC of 0.88. Other emotion categories, such as anger, disgust, joy, sadness, surprise, and anticipation, fall within the AUC range of 0.84 to 0.86, showing consistent performance across categories.

Meanwhile, the Bi-LSTM model showed a significant performance improvement over CNN, with an average AUC value of 0.92. In the emotion category, fear again recorded the highest AUC value of 0.93, followed by trust with 0.91. Other emotion categories, such as anger, anticipation, disgust, joy, sadness, and surprise, have AUC values in the range of 0.87 to 0.90. This shows that the Bi-LSTM model is able to detect emotions more accurately than the CNN model. For lresidential candidate entity detection, both CNN and Bi-LSTM showed excellent performance. The CNN model managed to achieve AUCs of 0.97 and 0.98 in recognizing the Anies, Prabowo, and Ganjar entities, while the Bi-LSTM model gave perfect results with an AUC of 1.00 for all three entities. This confirms that both models have a strong ability to detect candidate entities in the context of the presidential election.

The CNN-BiLSTM model, as a combination of both architectures, performed well with an average AUC value of 0.86. The fear and trust emotion categories recorded AUCs of 0.86 and 0.85, respectively, while the other emotions fell within the AUC range of 0.80 to 0.82. Although the AUC value of the emotion classification is slightly lower than that of the single Bi-LSTM model, the model still shows quite good results in classifying various emotions. In terms of presidential candidate entity detection, the CNN-BiLSTM model shows stronger performance compared to the emotion classification. The AUC for the Anies and Prabowo entities reached 0.94, while Ganjar recorded an AUC value of 0.93. This shows that the integration of CNN and Bi-LSTM provides an advantage in entity detection, although it slightly degrades the performance on emotion classification compared to a single Bi-LSTM model.

4.3. Hamming Loss

Hamming loss is a metric utilized to assess the model prediction error rate in multi-label classification problems. It measures the proportion of incorrect labels to the total, ideal for scenarios requiring individual label evaluation. This characteristic allows for an understanding of not only the overall performance but also the identification of specific labels that the model may be misclassifying. By computing the proportion of incorrectly predicted labels to the total labels, Hamming loss facilitates a detailed analysis, which is especially valuable in multi-label contexts where each label holds distinct importance.

In the realm of multi-label classification, Hamming loss is particularly pertinent as it allows for the evaluation of errors on each predicted label, providing deeper insights into areas where model performance may be deficient. The Hamming Loss vs. Threshold graph elucidates the relationship between the threshold value used in the Bi-LSTM model and the model’s performance as measured by the Hamming loss metric. The Y-axis in this graph shows the Hamming loss value, indicating how often the model mispredicts labels. The X-axis represents the threshold used to decide if a label is correct.

The evaluation of Hamming loss in CNN, Bi-LSTM, and CNN-BiLSTM models shows a consistent pattern related to determining the optimal threshold. In the CNN model, as shown in Figure 10, the Hamming loss is at a high value of 0.2112 when the threshold is at a low point of 0.1. This shows that at low thresholds, the model tends to predict labels incorrectly. As the threshold increases, the Hamming loss decreases until it reaches a low of 0.1000 at a threshold of 0.6, signaling the best performance of the model. However, once the threshold exceeds 0.5, the Hamming loss increases again, indicating that the correct predictions start to decrease. The Bi-LSTM model, as can be seen in Figure 10, shows similar results. At a threshold of 0.1, the Hamming loss is at a high number of 0.1397. As the threshold increases, the Hamming loss value decreases significantly and reaches its lowest point at 0.0816 at threshold 0.5. This shows the best performance of the Bi-LSTM model with the lowest prediction error at this point. As with the CNN model, once the threshold exceeds 0.5, the Hamming loss increases again, indicating that correct predictions start to be missed.

In the CNN-BiLSTM model, the Hamming loss evaluation shown in Figure 10 also follows a similar pattern. At a low threshold of 0.1, the Hamming loss is at 0.2283, which indicates many prediction errors. As the threshold increases, the Hamming loss decreases significantly and reaches the lowest value of 0.1107 at a threshold of 0.5. After this threshold is exceeded, the Hamming loss rises again, indicating that the correct predictions start to decrease. Based on the results of this evaluation, a threshold of around 0.5 consistently provides the lowest Hamming loss value for all three models. This indicates that it is the optimal point for the models to predict labels with the lowest error rate. If the threshold is set too low or too high, the Hamming loss increases, which means more prediction errors occur.

Overall, choosing the right threshold, which is around 0.5, is very important to minimize Hamming loss and improve accuracy in multi-label classification. All three models show their best performance in predicting labels at this threshold, although the Bi-LSTM model shows the lowest Hamming loss among the three.

Hamming loss is beneficial as it effectively captures the fraction of incorrect label predictions in multi-label scenarios, particularly highlighting issues with overlapping labels. However, it has limitations, such as ignoring True Positives and being sensitive to label imbalance, which can skew results. To address overlapping labels, strategies like adjusting the classification threshold or using additional metrics alongside Hamming loss can provide a more nuanced evaluation of model performance.

4.4. Comparison of Model Evaluation Results

In this subchapter, we compare the results of the three models that have been implemented, namely CNN, Bi-LSTM, and CNN Bi-LSTM. The comparison is based on several evaluation metrics, namely confusion matrix, AUC, and Hamming loss. The purpose of this comparison is to determine which model shows the best performance and the reason behind the assessment.

In Table 9, the Bi-LSTM model shows the highest accuracy results in most categories, indicating that it is more reliable in predicting the correct class compared to CNN and CNN Bi-LSTM. For instance, Bi-LSTM achieves top accuracy scores in joy (0.9538), sadness (0.9538), and trust (0.9964), highlighting its ability to make correct predictions. In comparison, CNN scores lower in sadness (0.9433) and trust (0.9591), while CNN Bi-LSTM does in joy (0.9400) and trust (0.9343). The higher accuracy of the Bi-LSTM model reflects its superior capacity for correct classification.

In Table 10, the Bi-LSTM model shows the best results in the precision metric for most categories. For example, it excels in fear (0.9539) and trust (0.9966), indicating that this model is better at minimizing False Positives, thus providing better predictions when determining that a class is correct. CNN, on the other hand, records lower precision for trust (0.8166), while CNN Bi-LSTM does for trust (0.9650) and fear (0.9438). In Table 11, the Bi-LSTM model also leads in the recall metric, with the highest scores in anger (0.7375) and trust (0.9966), suggesting it is more sensitive in detecting the correct class and better at minimizing False Negatives compared to CNN and CNN Bi-LSTM.

In Table 12, the Bi-LSTM model also achieves the best F1-score in most categories, such as fear (0.7339) and joy (0.6347). A higher F1-score indicates that this model is better at balancing precision and recall while minimizing False Negatives. In contrast, CNN shows a lower F1-score for sadness (0.4672) and anger (0.6109). In Table 13, Bi-LSTM shows the highest AUC values, such as for fear (0.93) and trust (0.99), indicating that it has a good balance between precision and recall for each class. A higher AUC suggests that Bi-LSTM is more reliable in distinguishing between true and false classes. Finally, in Table 14, the Bi-LSTM model demonstrates the lowest Hamming loss, with 0.1000 at the 0.50 threshold, indicating a lower misclassification rate compared to CNN (0.1002) and CNN Bi-LSTM (0.1107). A lower Hamming loss further confirms the Bi-LSTM model’s reliability in making accurate predictions.

5. Deployment

In this section, the researchers describe the deployment stage of the Bi-LSTM model that has been evaluated and proven to show the best performance. The deployment process involves several important steps to ensure the model can be implemented effectively. The first step in deployment includes the preparation of the framework used such as Flask for website development and various machine learning libraries used by the Bi-LSTM model. This preparation is used to ensure that the entire deployment process can run well.

This deployment process is carried out using the Flask framework to build a web application that can classify new YouTube comments that have not been labeled as in Table 15 and provide output in the form of files containing comments along with the appropriate labels. After the user uploads the XLSX file containing the new YouTube comment through the web interface as shown in Figure 11, the file is stored in a special directory to facilitate further processing. Furthermore, the application reads the uploaded file and extracts the comments it contains. The comments are then preprocessed, including going through tokenization and padding stages, to match the input format required by the BI-LSTM model. The BI-LSTM model then performs predictions for each comment and generates the corresponding labels.

The resulting label is added to the original file as a new column. The file that has been labeled as in Table 16 is then saved in the output directory. Users can download the labeled file through the Flask application web interface as shown in Figure 12. Thus, the evaluated Bi-LSTM model can proceed to the deployment stage to process YouTube comments automatically and provide appropriate results in the form of labeled xlsx files. This web application makes it easy for users to upload unlabeled comments and obtain labeling results quickly and efficiently. After the labeling process is complete, the dataset can be used to perform various visualizations in analyzing the sentiment of YouTube comments related to the presidential candidate debate that have been labeled.

Figure 13 shows the volume of conversation for each candidate, Anies, Prabowo, and Ganjar, based on YouTube comments from the Indonesian presidential candidate debate. The data visualization on this graph highlights the total number of comments that mention each candidate.

In Figure 13, the Anies graph has the highest volume of conversation with more than 10,000 mentions. This shows that Anies is the most talked about candidate among the three candidates. This high volume of conversation can be caused by various factors such as popularity, controversy, or certain topics that attract public attention. Prabowo is in second place after Anies, with around 7500 mentions. Although lower than Anies, this number still shows significant interest in Prabowo. Meanwhile, Ganjar has the lowest volume of conversation with less than 2000 mentions. This shows that Ganjar is the least talked about candidate among the candidates.

In Figure 14, candidate Anies has the highest level of trust with more than 6000 mentions. This level of trust indicates that many commentators have a high level of trust in Anies. The emotion of joy is in second place after trust with more than 5000 mentions, indicating that Anies received many positive and joyful comments. The emotion of anticipation has around 2500 mentions, indicating the hope or expectation of the public towards Anies. Meanwhile, negative emotions such as anger, disgust, fear, sadness, and surprise have a lower number of mentions.

Prabowo also shows dominance in the emotion of trust with around 4000 mentions, indicating a high level of trust from the public. The emotion of joy is also significant with around 3000 mentions, indicating strong positive support. The emotion of anticipation appears at around 2000 mentions, indicating hope or expectation from the public towards Prabowo. Negative emotions such as anger, disgust, fear, sadness, and surprise are also present in lower numbers, indicating some criticism or concerns that need to be addressed by Prabowo’s campaign team.

Candidate Ganjar had a lower volume of conversations, showing dominance in the emotion of trust with over 1000 mentions, much lower than the other two candidates. The emotion of joy also appeared around 7000 mentions, signaling positive support albeit on a smaller scale. The emotions of anticipation, anger, disgust, fear, sadness, and surprise had lower numbers of mentions but remained significant, indicating that although Ganjar received fewer comments, these emotions were still present in public discussion.

Overall, Anies had the highest number of positive mentions with over 6000 for trust and over 5000 for excitement, indicating very strong support from the public. Prabowo also has significant levels of trust and excitement, but lower than Anies. Ganjar, although with a lower volume of conversations, still showed dominance in the emotions of trust and excitement, signaling support, although not as great as the other two candidates. Negative emotions were present in lower numbers for all candidates, indicating an area of concern for improving public sentiment.

6. Conclusions

This research involved the multi-label classification of YouTube comments related to the 2024 Indonesian presidential election using three main methods, namely Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (Bi-LSTM), and a combination of CNN and Bi-LSTM. This allowed the identification of various public emotions expressed in comments, such as anger, anticipation, disgust, joy, fear, sadness, surprise, and trust.

The results show that the Bi-LSTM model performs better than the CNN and CNN Bi-LSTM models in multi-label sentiment classification. The analysis shows the variation of people’s emotions related to the 2024 presidential election, where emotions such as joy and trust are more dominant towards certain candidates, while anger and disgust appear towards other candidates. The conversation volume graph shows that Anies is the most talked about candidate with more than 10,000 mentions, followed by Prabowo with around 7500 mentions, and Ganjar with less than 2000 mentions. The emotion distribution graph shows that comments related to Anies are dominated by the emotions of trust and joy, while comments related to Prabowo and Ganjar show a more balanced variation of emotions between anger, anticipation, and joy.

The Bi-LSTM model performed best with the highest accuracy, best AUC value, and lowest Hamming loss in multi-label sentiment classification, followed by CNN and CNN Bi-LSTM. This conclusion shows that although theoretically combining the CNN and Bi-LSTM models can provide additional advantages by capturing spatial and temporal features simultaneously, in practice the single Bi-LSTM model shows better performance in the context of multi-label sentiment analysis on the dataset used in this study. This research provides insight into the emotions of Indonesians regarding the 2024 presidential election and demonstrates the effectiveness of the Bi-LSTM model in multi-label sentiment classification.

From this research, it can be expected that this research can be utilized by the campaign team of a candidate partner in a general election. By using the proposed model, the campaign team can analyze controversial issues in a video. The campaign team can also adjust to the campaign strategy in real-time after knowing the emotional reactions and digital behavior of the public and then develop a more effective campaign narrative in general elections.

Author Contributions

Conceptualization, A.N.M., D.P. and H.F.; methodology, A.N.M., D.P. and H.F.; software, A.N.M. and A.D.F.; validation, D.P. and H.F.; formal analysis, D.P. and H.F.; investigation, A.N.M. and A.D.F.; resources, A.N.M. and A.D.F.; data curation, A.N.M. and A.D.F.; writing—original draft preparation, A.N.M. and A.D.F.; writing—review and editing, D.P. and H.F.; visualization, A.N.M. and A.D.F.; supervision, D.P. and H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets analyzed or generated during the research can be found at Telkom University Dataverse at the following link: https://doi.org/10.34820/FK2/X7FPFW (accessed on 24 September 2024).

Acknowledgments

We are grateful for the publication grant from Telkom University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shevtsov, A.; Oikonomidou, M.; Antonakaki, D.; Pratikakis, P.; Ioannidis, S. What Tweets and YouTube Comments Have in Common? Sentiment and Graph Analysis on Data Related to US Elections 2020. PLoS ONE 2023, 18, e0270542. [Google Scholar] [CrossRef] [PubMed]
Budiharto, W.; Meiliana, M. Prediction and Analysis of Indonesia Presidential Election from Twitter Using Sentiment Analysis. J. Big Data 2018, 5, 51. [Google Scholar] [CrossRef]
Widayat, R.M.; Nurmandi, A.; Rosilawati, Y.; Natshir, H.; Syamsurrijal, M.; Baharuddin, T. Bibliometric Analysis and Visualization Articles on Presidential Election in Social Media Indexed in Scopus by Indonesian Authors. In Proceedings of the 1st World Conference on Social and Humanities Research (W-SHARE 2021), Makassar, Indonesia, 7–8 December 2021. [Google Scholar]
Medhat, W.; Hassan, A.; Korashy, H. Sentiment Analysis Algorithms and Applications: A Survey. Ain Shams Eng. J. 2014, 5, 1093–1113. [Google Scholar] [CrossRef]
Yadollahi, A.; Shahraki, A.G.; Zaiane, O.R. Current State of Text Sentiment Analysis from Opinion to Emotion Mining. ACM Comput. Surv. 2018, 50, 1–33. [Google Scholar] [CrossRef]
Eaton, J. From the Comments Section: Analyzing Online Public Discourse on the First 2020 Presidential Debate. Res. Politics 2024, 11, 20531680241271758. [Google Scholar] [CrossRef]
Fathurahman, A.D.; Pramesti, D.; Fakhrurroja, H. Sentiment Analysis of Presidential Debate Videos on YouTube in the 2024 Indonesian Presidential Elections. In Proceedings of the 2024 International Conference on Data Science and Its Applications (ICoDSA), Kuta, Indonesia, 10–11 July 2024; pp. 545–550. [Google Scholar]
Bouazizi, M.; Ohtsuki, T. Multi-Class Sentiment Analysis on Twitter: Classification Performance and Challenges. Big Data Min. Anal. 2019, 2, 181–194. [Google Scholar] [CrossRef]
Asghar, M.Z.; Khan, A.; Bibi, A.; Kundi, F.M.; Ahmad, H. Sentence-Level Emotion Detection Framework Using Rule-Based Classification. Cognit. Comput. 2017, 9, 868–894. [Google Scholar] [CrossRef]
Storey, V.C.; Park, E.H. An Ontology of Emotion Process to Support Sentiment Analysis. J. Assoc. Inf. Syst. 2022, 23, 999–1036. [Google Scholar] [CrossRef]
Ma’Aly, A.N.; Pramesti, D.; Fakhrurroja, H. Comparative Analysis of Deep Learning Models for Multi-Label Sentiment Classification of 2024 Presidential Election Comments. In Proceedings of the 2024 7th International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 17–18 July 2024; pp. 502–507. [Google Scholar]
Gargiulo, F.; Silvestri, S.; Ciampi, M. Deep Convolution Neural Network for Extreme Multi-Label Text Classification. In Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies, Funchal, Portugal, 19–21 January 2018; pp. 641–650. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cahyadi, A.; Khodra, M.L. Aspect-Based Sentiment Analysis Using Convolutional Neural Network and Bidirectional Long Short-Term Memory. In Proceedings of the 2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA), Krabi, Thailand, 14–17 August 2018; pp. 124–129. [Google Scholar]
Ameer, I.; Bölücü, N.; Siddiqui, M.H.F.; Can, B.; Sidorov, G.; Gelbukh, A. Multi-Label Emotion Classification in Texts Using Transfer Learning. Expert Syst. Appl. 2023, 213, 118534. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Jang, B.; Kim, M.; Harerimana, G.; Kang, S.; Kim, J.W. Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
Pati, D.; Lorusso, L.N. How to Write a Systematic Review of the Literature. HERD Health Environ. Res. Des. J. 2018, 11, 15–30. [Google Scholar] [CrossRef] [PubMed]
Wisnubroto, A.S.; Saifunas, A.; Santoso, A.B.; Putra, P.K.; Budi, I. Opinion-Based Sentiment Analysis Related to 2024 Indonesian Presidential Election on YouTube. In Proceedings of the 2022 5th International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Virtual, 8–9 December 2022; pp. 318–323. [Google Scholar]
Mandhasiya, D.G.; Murfi, H.; Bustamam, A.; Anki, P. Evaluation of Machine Learning Performance Based on BERT Data Representation with LSTM Model to Conduct Sentiment Analysis in Indonesian for Predicting Voices of Social Media Users in the 2024 Indonesia Presidential Election. In Proceedings of the 2022 5th International Conference on Information and Communications Technology (ICOIACT), Online, 24–25 August 2022; pp. 441–446. [Google Scholar]
Jabreel, M.; Moreno, A. A Deep Learning-Based Approach for Multi-Label Emotion Classification in Tweets. Appl. Sci. 2019, 9, 1123. [Google Scholar] [CrossRef]
Macrohon, J.J.E.; Villavicencio, C.N.; Inbaraj, X.A.; Jeng, J.-H. A Semi-Supervised Approach to Sentiment Analysis of Tweets during the 2022 Philippine Presidential Election. Information 2022, 13, 484. [Google Scholar] [CrossRef]
He, H.; Xia, R. Joint Binary Neural Network for Multi-Label Learning with Applications to Emotion Classification. In Natural Language Processing and Chinese Computing: 7th CCF International Conference, NLPCC 2018, Hohhot, China, August 26–30, 2018, Proceedings, Part I 7; Springer: Cham, Switzerland, 2018; pp. 250–259. [Google Scholar]
Irtiza Tripto, N.; Eunus Ali, M. Detecting Multilabel Sentiment and Emotions from Bangla YouTube Comments. In Proceedings of the 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), Sylhet, Bangladesh, 21–22 September 2018; pp. 1–6. [Google Scholar]
Samy, A.E.; El-Beltagy, S.R.; Hassanien, E. A Context Integrated Model for Multi-Label Emotion Detection. Procedia Comput. Sci. 2018, 142, 61–71. [Google Scholar] [CrossRef]
Firmansyah, F.; Zulfikar, W.B.; Maylawati, D.S.; Arianti, N.D.; Muliawaty, L.; Septiadi, M.A.; Ramdhani, M.A. Comparing Sentiment Analysis of Indonesian Presidential Election 2019 with Support Vector Machine and K-Nearest Neighbor Algorithm. In Proceedings of the 2020 6th International Conference on Computing Engineering and Design (ICCED), Sukabumi, Indonesia, 15–16 October 2020; pp. 1–6. [Google Scholar]
Manik, L.P.; Febri Mustika, H.; Akbar, Z.; Kartika, Y.A.; Ridwan Saleh, D.; Setiawan, F.A.; Atman Satya, I. Aspect-Based Sentiment Analysis on Candidate Character Traits in Indonesian Presidential Election. In Proceedings of the 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET), Virtual Conference, 18–20 November 2020; pp. 224–228. [Google Scholar]
Mohammad, S.M.; Zhu, X.; Kiritchenko, S.; Martin, J. Sentiment, Emotion, Purpose, and Style in Electoral Tweets. Inf. Process. Manag. 2015, 51, 480–499. [Google Scholar] [CrossRef]
Schröer, C.; Kruse, F.; Gómez, J.M. A Systematic Literature Review on Applying CRISP-DM Process Model. Procedia Comput. Sci. 2021, 181, 526–534. [Google Scholar] [CrossRef]
Schwartz, H.A.; Ungar, L.H. Data-Driven Content Analysis of Social Media. Ann. Am. Acad. Pol. Soc. Sci. 2015, 659, 78–94. [Google Scholar] [CrossRef]
Baziotis, C.; Nikolaos, A.; Chronopoulou, A.; Kolovou, A.; Paraskevopoulos, G.; Ellinas, N.; Narayanan, S.; Potamianos, A. NTUA-SLP at SemEval-2018 Task 1: Predicting Affective Content in Tweets with Deep Attentive RNNs and Transfer Learning. In Proceedings of the 12th International Workshop on Semantic Evaluation; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 245–255. [Google Scholar]
Zhu, M.; Xia, J.; Jin, X.; Yan, M.; Cai, G.; Yan, J.; Ning, G. Class Weights Random Forest Algorithm for Processing Class Imbalanced Medical Data. IEEE Access 2018, 6, 4641–4652. [Google Scholar] [CrossRef]
Irawaty, I.; Andreswari, R.; Pramesti, D. Vectorizer Comparison for Sentiment Analysis on Social Media Youtube: A Case Study. In Proceedings of the 2020 3rd International Conference on Computer and Informatics Engineering (IC2IE), Yogyakarta, Indonesia, 15–16 September 2022; pp. 69–74. [Google Scholar]

Figure 1. YouTube searches for “Election” in Indonesia.

Figure 2. Process of the Systematic Literature Review (SLR) illustrated using the PRISMA diagram.

Figure 3. CRISP-DM methodology.

Figure 4. Distribution of comments grouped by video source.

Figure 5. GPT labelling process.

Figure 6. CNN confusion matrix.

Figure 7. Bi-LSTM confusion matrix.

Figure 8. CNN Bi-LSTM confusion matrix.

Figure 9. Area Under Curve results.

Figure 10. Hamming loss results.

Figure 11. Deployment process classification.

Figure 12. Deployment classification result.

Figure 13. Volume of conversation for each candidate graph.

Figure 14. Emotional distribution for each candidate.

Table 1. Final systematic literature review extraction.

Reference	Case Study	Algorithm	Multi-Label	Multi-Class	Accuracy	Key Findings
[19]	Opinion-based sentiment analysis related to 2024 presidential election on YouTube	LR, VOT, RF, DR, KNN, XGB, LGBM, AB	N/A	N/A	0.77 (LR), 0.76 (VOT), 0.77 (RF), 0.65 (DT), 0.57 (KNN), 0.76 (XGB), 0.69 (LGBM), 0.70 (AB)	This study found that most opinions regarding the 2024 Indonesian presidential candidates on YouTube are negative, with user interactions showing a pattern of connected counter-responses.
[20]	Sentiment analysis of social media users’ voices regarding the 2024 Indonesian presidential election	LSTM and BERT	N/A	N/A	0.83 (BERT) and 0.85 (BERT-LSTM)	The key finding is that the BERT-LSTM model outperformed the BERT model with an accuracy of 0.8783 in the sentiment analysis of YouTube comments related to the 2024 Indonesian presidential election.
[21]	SemEval-2018 Task 1: Affect in Tweets	Binary Neural Network (BNet)	Yes	N/A	0.59	The study developed a new classification method for multi-label data on Twitter. The method transforms data into binary classification problems and uses deep learning to solve them. The study’s method achieved an accuracy score of 0.59 on SemEval-2018 Task 1.
[22]	Sentiment analysis of tweets during the 2022 Philippine presidential election	Naïve Bayes	N/A	Yes	0.84	The key finding is that the model achieved 84.83% accuracy in classifying English and Tagalog tweets during the Philippines’ presidential race.
[23]	Ren-CECps corpus	Joint Binary Neural Network (JBNN)	Yes	N/A	N/A	The study presents a Joint Binary Neural Network (JBNN) that enhances multi-label emotion classification by synchronously performing binary classifications and capturing label relationships, outperforming existing methods.
[24]	YouTube video comments in the Bangla language	LSTM and CNN	Yes	N/A	0.59 (LSTM) and 0.54 (CNN)	The study successfully developed a deep learning model to analyze sentiment and extract emotions from Bangla text, demonstrating significant potential in a less-explored language context.
[25]	SemEval-2017 Task 4: Topic-Based Sentiment Classification and SemEval-2018	Context-Aware Gated Recurrent Unit (C-GRU)	Yes	N/A	0.53	The C-GRU (Context-Aware Gated Recurrent Units) model effectively enhances the sentiment analysis of tweets by incorporating contextual information (topics) and demonstrating superior performance in Arabic multi-label emotion classification compared to the highest reported results on the SemEval-2018 dataset.
[2]	Tweets related to the 2019 Indonesian presidential election	TextBlob	N/A	Yes	N/A	The study developed a new approach that utilized Twitter to predict election results, which in this case was the 2019 Indonesian presidential election. The method itself used the R language to predict the result.
[26]	Tweets related to the 2019 Indonesian presidential election	Support Vector Machine (SVM) and K-Nearest Neighbor (KNN)	N/A	N/A	0.69 (SVM average) and 0.61 (KNN average)	Automatic content analysis of social media reveals insights into well-being, health, gender differences, and personality, offering advantages over traditional survey methods.
[27]	Tweets related to the 2019 Indonesian presidential election	Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbor (KNN)	N/A	N/A	0.89 (SVM), 0.78 (NB), 0.82 (KNN)	The study finds that Support Vector Machine outperforms Naïve Bayes and K-Nearest Neighbor in the aspect-based sentiment analysis of Indonesian presidential candidates.
[28]	Tweets related to 2012 US presidential election	Support Vector Machine (SVM)	N/A	Yes (11-Class)	0.435 (SVM)	The study annotates 2012 US presidential election tweets for sentiment and emotions, finding that purpose differs from emotions, and establishes baseline results for automatic classifiers.
This Work	YouTube video comments about Indonesia presidential election	CNN, Bi-LSTM, CNN-Bi-LSTM	Yes (8 emotions, 3 candidates)	No	0.984 (Bi-LSTM)	The Bi-LSTM model outperforms others in the multi-label sentiment analysis of YouTube comments on the 2024 Indonesian presidential election, achieving an AUC of 0.91.

Table 2. Preprocessing process.

No.	Before	After
1	kpu di isi orang orng baik. Pasti Amanah… Dan insyllah bpk anis b menang satu putaran…❤ ¹#perubahan (The KPU is filled with good people. It will surely be trustworthy… And God willing, Mr. Anies will win in one round… ❤ ¹ #change)	kpu di isi orang baik pasti amanah dan insyallah bapak anies menang satu putaran perubahan (the kpu is filled with good people it will surely be trust worthy and god willing mr anies will win in one round change)
2	Ketika debat terlihat sekali bahwa pasangan pak Anies dan cak Imin ini sangat pintar dalam memaparkan penjelasan sangat cerdas dan berwibawa. 😮 ² (During the debate, it was very clear that Mr. Anies and Cak Imin’s pair were very intelligent in presenting their explanations—very smart and authoritative. 😮 ²)	ketika debat terlihat bahwa pasangan pak anies dan cak imin ini sangat pintar dalam memaparkan penjelasan sangat cerdas dan berwibawa (during the debate it was very clear that mr anies and cak imins pair were very intelligent in presenting their explanations very smart and authoritative)
3	fans prabowo brisik banget di sosmed, nanti kalah lagi yang di salahin KPU. WKWKWKWKWKW Ganjar is the best (Fans of prabowo are so noisy on social media; if they lose again, they will blame the KPU. Haha, Ganjar is the best!)	fans prabowo brisik banget di sosial media nanti kalah lagi yang di salahin kpu ganjar is the best (fans of prabowo are so noisy on social media if they lose again they will blame the kpu haha ganjar is the best)

¹ Represents love or affection. ² Indicates a surprised or amazed reaction.

Table 3. CNN layers.

Layer	Type	Output, Shape	Parameter
1	Embedding	(None, None, 64)	1,946,944
2	Conv1D	(None, None, 128)	41,088
3	Max_pooling1D	(None, None, 128)
4	Conv1D_1	(None, None, 128)	82,048
5	Max_Pooling1D	(None, None, 128)
6	Global_Max_Pooling1d	(None, 128)
7	Dropout	(None, 128)
8	Dense	(None, 128)	16,512
9	Dropout_1	(None, 128)
10	Dense_1	(None, 11)	1419

Table 4. Bi-LSTM layers.

Layers	Type	Output Shape	Parameter
1	embedding (Embedding)	(None, None, 64)	1,946,944
2	dropout (Dropout)	(None, None, 64)	0
3	Bidirectional (Bidirectional)	(None, None, 32)	10,368
4	dropout_1 (Dropout)	(None, None, 32)	0
5	Bidirectional_1(Bidirectional)	(None, 32)	6272
6	dropout_2 (Dropout)	(None, 32)	0
7	dense (Dense)	(None, 11)	363
8	dropout_3 (Dropout)	(None, 11)	0

Table 5. CNN Bi-LSTM layers.

Layers	Type	Output Shape	Parameter
1	embedding (Embedding)	(None, None, 64)	1,946,944
2	dropout (Dropout)	(None, None, 64)	0
3	conv1d (Conv1D)	(None, None, 16)	5136
4	max_pooling1d (MaxPooling1D)	(None, None, 16)	0
5	Bidirectional (Bidirectional)	(None, 32)	4224
6	dropout_1 (Dropout)	(None, 32)	6272
7	bidirectional_1 (Bidirectional)	(None, 11)	363
8	dropout_2 (Dropout)	(None, 32)	0
9	dense (Dense)	(None, 11)	363
10	dropout_3 (Dropout)	(None, 11)	0

Table 6. CNN confusion matrix calculation results.

Label	Accuracy	Precision	Recall	F1-Score
Anger	0.8336	0.6839	0.5521	0.6109
Anticipation	0.9267	0.3227	0.4826	0.3867
Disgust	0.8185	0.7006	0.5398	0.6097
Fear	0.9459	0.6210	0.5804	0.6000
Joy	0.7766	0.7625	0.5867	0.6631
Sadness	0.9433	0.5494	0.4065	0.4672
Surprise	0.9224	0.6273	0.3221	0.4256
Trust	0.8166	0.8088	0.6800	0.7388
Anies	0.9591	0.9705	0.8650	0.9147
Prabowo	0.9691	0.9827	0.8766	0.9266
Ganjar	0.9964	0.9904	0.9626	0.9763

Table 7. Bi-LSTM confusion matrix calculation results.

Label	Accuracy	Precision	Recall	F1 Score
Anger	0.8532	0.7375	0.5895	0.6552
Anticipation	0.9499	0.4767	0.4674	0.4720
Disgust	0.8328	0.7350	0.5684	0.6410
Fear	0.9539	0.7116	0.5729	0.6347
Joy	0.8198	0.8217	0.6631	0.7339
Sadness	0.9538	0.7011	0.4269	0.5306
Surprise	0.9258	0.6538	0.3571	0.4619
Trust	0.8292	0.8548	0.6651	0.7481
Anies	0.9928	0.9966	0.9750	0.9856
Prabowo	0.9954	0.9981	0.9813	0.9896
Ganjar	0.9964	0.9904	0.9626	0.9763

Table 8. CNN Bi-LSTM confusion matrix calculation results.

Label	Accuracy	Precision	Recall	F1 Score
Anger	0.8292	0.7071	0.4747	0.5680
Anticipation	0.9436	0.4105	0.4087	0.4095
Disgust	0.8085	0.7007	0.4732	0.5649
Fear	0.9438	0.6650	0.3958	0.4962
Joy	0.7495	0.7649	0.4789	0.5890
Sadness	0.9400	0.5172	0.3061	0.5890
Surprise	0.9164	0.5758	0.2392	0.3845
Trust	0.7941	0.8054	0.6067	0.3379
Anies	0.9343	0.9650	0.7690	0.6920
Prabowo	0.9428	0.9710	0.7662	0.8559
Ganjar	0.9803	0.9828	0.7610	0.8565

Table 9. Comparison of accuracy.

Model	Accuracy
Model	Anger	Anticipation	Disgust	Fear	Joy	Sadness	Surprise	Trust	Anies	Prabowo	Ganjar
CNN	0.8336	0.9267	0.8185	0.9459	0.7766	0.9433	0.9224	0.8166	0.9591	0.9691	0.9865
Bi-LSTM	0.8532	0.9499	0.8328	0.9539	0.8198	0.9538	0.9258	0.8292	0.9928	0.9954	0.9964
CNN Bi-LSTM	0.8292	0.9436	0.8085	0.9438	0.7495	0.9400	0.9164	0.7941	0.9343	0.9428	0.9803

Table 10. Comparison of precision.

Model	Precision
Model	Anger	Anticipation	Disgust	Fear	Joy	Sadness	Surprise	Trust	Anies	Prabowo	Ganjar
CNN	0.6839	0.3227	0.7006	0.6210	0.7625	0.5494	0.6273	0.8088	0.9705	0.9827	0.9390
Bi-LSTM	0.7375	0.4767	0.7350	0.7116	0.8217	0.7011	0.6538	0.8548	0.9966	0.9981	0.9904
CNN Bi-LSTM	0.7071	0.4105	0.7007	0.6650	0.7649	0.5172	0.5758	0.8054	0.9650	0.9710	0.9828

Table 11. Comparison of recall.

Model	Recall
Model	Anger	Anticipation	Disgust	Fear	Joy	Sadness	Surprise	Trust	Anies	Prabowo	Ganjar
CNN	0.5521	0.4826	0.5398	0.5804	0.5867	0.4065	0.3221	0.6800	0.8650	0.8766	0.8838
Bi-LSTM	0.5895	0.4674	0.5684	0.5729	0.6631	0.4269	0.3571	0.6651	0.9750	0.9813	0.9626
CNN Bi-LSTM	0.4747	0.4087	0.4732	0.3958	0.4789	0.3061	0.2392	0.6067	0.7690	0.7662	0.7610

Table 12. Comparison of F1-Score.

Model	F1-Score
Model	Anger	Anticipation	Disgust	Fear	Joy	Sadness	Surprise	Trust	Anies	Prabowo	Ganjar
CNN	0.6109	0.3867	0.6097	0.6000	0.6631	0.4672	0.4256	0.7388	0.9147	0.9266	0.9763
Bi-LSTM	0.6552	0.47200	0.6410	0.6347	0.7339	0.5306	0.4619	0.7481	0.9856	0.9896	0.9763
CNN Bi-LSTM	0.4747	0.4087	0.4732	0.3958	0.4789	0.3061	0.2392	0.6067	0.7690	0.7662	0.7610

Table 13. Comparison of AUC.

Model	AUC
Model	Anger	Anticipation	Disgust	Fear	Joy	Sadness	Surprise	Trust	Anies	Prabowo	Ganjar	Avg
CNN	0.86	0.84	0.85	0.91	0.85	0.85	0.85	0.88	0.97	0.97	0.98	0.89
Bi-LSTM	0.89	0.89	0.88	0.93	0.90	0.90	0.88	0.91	0.99	1.00	1.00	0.92
CNN Bi-LSTM	0.82	0.82	0.82	0.86	0.82	0.82	0.80	0.85	0.94	0.94	0.93	0.86

Table 14. Comparison of Hamming loss.

Hamming Loss
Model	Threshold
Model	0.10	0.20	0.30	0.40	0.50	0.60	0.70	0.80	0.90
CNN	0.2112	0.1314	0.1124	0.1038	0.1002	0.1000	0.1019	0.1080	0.1186
Bi-LSTM	0.1397	0.1024	0.0877	0.0821	0.0816	0.0843	0.0915	0.1060	0.1305
CNN Bi-LSTM	0.2283	0.1363	0.1203	0.1127	0.1107	0.1116	0.1157	0.1256	0.1423

Table 15. Original YouTube comments.

CommentID	Author	Comment_Text
7	@wilson.57	Keren.NAMANYA USAHA.usaha menjatuhkan.Anis bela HTI 😂😂😂 ¹
8	@ntiiofficial.799	Salfok sm ada bapak “tiba” ngomong bacot pas prabowo jelasin HAM
9	@KeseharianNasibArawana	Yang bener pahlawan Tanah air itu yah prabowo. Yang pahlawan siang bolong itu yah Anies 😂😂 ¹
10	@adityaindrahadi5091	Pinter banget anjir ngomongnya si anis
11	@elocopilot6545	Kenapa judulnya harus “DEBAT”? Kenapa nggak dikasi judul "ADU GAGASAN & WAWASAN" atau apa gitu? yang konotasi lebih positive drpd "DEBAT"?
12	@monkeydluffy2806	Hahahah ketar ketir kalian Terbukti All in prabowo😂😂😂 ¹
6	@KeseharianNasibArawana	Yang bener pahlawan Tanah air itu yah prabowo. Yang pahlawan siang bolong itu yah Anies 😂😂 ¹

¹ Denotes laughter or amusement.

Table 16. Dataset classification results.

ID	Cleaned_Text	Anticipation	Joy	Trust	Anies
137	setelah melihat debat kali ini keluarga besar saya dukung amin	0	1	1	1
138	pak anies baswedan dan muhaimin iskandar memang pasangan yang saling melengkapi	1	1	1	1
139	saya pastikan keluarga besar saya dukung amin untuk indonesia makmur	1	0	1	1
140	pak anies baswedan adalah pemimpin masa depan indonesia jakarta adalah bukti nyata	0	1	1	1
141	luar biasa kecerdasan pak anies baswedan memaparkan program kerjanya	0	1	1	1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma’aly, A.N.; Pramesti, D.; Fathurahman, A.D.; Fakhrurroja, H. Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm. Information 2024, 15, 705. https://doi.org/10.3390/info15110705

AMA Style

Ma’aly AN, Pramesti D, Fathurahman AD, Fakhrurroja H. Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm. Information. 2024; 15(11):705. https://doi.org/10.3390/info15110705

Chicago/Turabian Style

Ma’aly, Ahmad Nahid, Dita Pramesti, Ariadani Dwi Fathurahman, and Hanif Fakhrurroja. 2024. "Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm" Information 15, no. 11: 705. https://doi.org/10.3390/info15110705

APA Style

Ma’aly, A. N., Pramesti, D., Fathurahman, A. D., & Fakhrurroja, H. (2024). Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm. Information, 15(11), 705. https://doi.org/10.3390/info15110705

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Sentiment Analysis for the Indonesian Presidential Election Through Online Reviews Using Multi-Label Classification with a Deep Learning Algorithm

Abstract

1. Introduction

2. Systematic Literature Review

3. Methodology and Implementation

3.1. Business Understanding

3.2. Data Understanding

3.3. Data Preparation

3.4. Pre-Trained Model

3.5. Modelling

3.5.1. CNN Implementation

3.5.2. Bi-LSTM Implementation

3.5.3. CNN Bi-LSTM Implementation

4. Result and Discussion

4.1. Confusion Matrix

4.1.1. CNN

4.1.2. Bi-LSTM

4.1.3. CNN Bi-LSTM

4.2. AUC

4.3. Hamming Loss

4.4. Comparison of Model Evaluation Results

5. Deployment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI