1. Introduction
The rapid expansion of the internet has increased accessibility to vast numbers of user-generated text data, thereby propelling the field of Natural Language Processing (NLP). In particular, opinion mining and sentiment analysis have garnered significant attention due to their applicability in diverse fields such as social network analysis and customer service [
1]. Sentiment analysis classifies the sentimental state expressed in sentences or documents as positive, negative, or neutral [
2,
3,
4]. Emotion analysis classifies emotions into predefined categories such as joy, anger, and sadness, providing a more detailed spectrum [
5]. Sentiment analysis and emotion analysis have evolved into Aspect-Based Sentiment Analysis (ABSA) [
6,
7,
8] and Aspect-Based Emotion Analysis (ABEA) [
9,
10]. ABSA tasks primarily address review data, aiming to identify varying sentiment polarities associated with specific aspects (targets). For instance, in the review “The cafe’s atmosphere is good, but the coffee is terrible”, aspects are “atmosphere” and “coffee”, with positive polarity for “atmosphere” and negative for “coffee”.
In the ABSA task, knowing how to handle and deal with targets is an important factor. After the appearance of the Bidirectional Encoder Representations from Transformers (BERT), research on Targeted ABSA (TABSA) or Aspect Sentiment Classification (ASC) has made significant progress. The focus is on how to construct an input with sentence pair classification problems [
11,
12]. In other words, sentences provide the context, and the target functions as a question to find the answer or sentiment. This methodology has spurred research that emphasizes the target using BERT’s context-aware representations. Research in [
13] applied max pooling to the target vector to extract the most significant information. Another study [
14] introduced multi-head attention between the sentence and target to obtain a target-specific context representation.
Compared to ABSA, ABEA needs a more complex model that focuses on targets in order to clarify the emotional spectrum. Research on emotion analysis and ABEA has been relatively delayed, partly due to the inherent challenges in categorizing human emotion. This has led to considerable debate within the academic community [
15,
16]. Contrary to sentiment analysis, which employs binary or ternary classification schemes that are mutually exclusive, emotion analysis is addressed as multi-label classification. This task accommodates label imbalance and requires a more elaborate model.
The lack of research on emotion analysis and ABEA is more pronounced in the Korean language. While sentiment analysis and ABSA research have been extensively explored [
17,
18,
19], the progression into emotion analysis and ABEA has been more gradual. The recent proliferation of social media has led to a significant acceleration in the creation of Korean-specific datasets [
20]. However, research addressing them to develop emotion analysis models for Korean texts is still emerging.
The unique Korean-language-specific features are shown in
Table 1. The linguistic characteristics of Korean have made research on Korean models challenging. Employing an attention mechanism to concentrate on targets within sentences might help identify emotions embedded in complex linguistic structures. Firstly, Korean tends to follow a spiral thought pattern, preferring to deliver information comprehensively across sentences or paragraphs [
21]. This structural preference necessitates clear target information to understand the accurate meaning of sentences. Second, Korean has agglutinative language characteristics. The morphological structure of Korean often leads to ambiguous boundaries around targets. This is because various morphemes, such as Josa (postposition) and Eomi (ending of verb), combine before and after a target [
22]. By focusing on specific targets, models can clearly separate and understand the target. Third, Korean frequently employs ellipses, where sentences often rely on context for meaning, with subjects or objects sometimes omitted [
23]. Attention mechanisms can enhance the model’s capability to retain information relevant to the targets.
Target-based analysis might also address challenges arising from the unrefined nature of social media data and the complexity due to its broad spectrum of content areas. Social media encompass a variety of sources, such as entertainment, sports, and literature, and are characterized by the frequent emergence of unseen words. To address dynamic and qualitative data expansion, domain-specific Pre-Trained Language Models (PLMs) have emerged [
24,
25]. By leveraging these domain-sensitive embeddings, focusing on targets within diverse and evolving data enables models to quickly adapt to new terms and expressions. This approach helps models capture relevant information and filter out irrelevant noise despite the informal and varied language use.
The objective of this paper is to propose a target-attention-based emotion classifier. This model applies an attention mechanism to vectors encoded by a PLM, attending intently to the target vector for Targeted ABEA (TABEA) in Korean. We explore various methods for formulating the target vector and implementing attention. To mitigate the potential issue of overlooking broader sentence context due to overconcentration on the target vector, we consider strategies to regulate the attention output. In addition to traditional sentence pair classification methods, we also attempt a method of discerning targets solely by applying target attention. In the experiment section, we design an evaluation method to measure the model’s ability to identify contrasting emotional states within a single text. We collect evaluation samples, which we call Multi-Target Multi-Emotion (MTME), to assess this ability. Through these approaches, we aim to implement a target-attention-based model that addresses the TABEA task for data characterized by the linguistic features of the Korean language and the unique properties of social media.
The contributions of this paper can be summarized as follows:
We propose the Korean Target-Attention-Based Emotion Classifier (KOTAC), designed to address a TABEA task. This model uniquely caters to data influenced by the linguistic features of the Korean language and the dynamics of social media. By applying an attention mechanism to targets within sentences, KOTAC aids in unveiling emotions buried within intricate language patterns. The proposed KOTAC outperforms a model that relies solely on a sentence pair approach;
To the best of our knowledge, the present study is the first attempt to systematically deal with targets of sentences and explore various ways of formulating target vectors. This effort contributes to developing a Korean language model that reflects syntactic features;
This study not only investigates methods to obtain representations focused on the target but also explores how to utilize these. Based on experimental results, this study further analyzes the relationships between methods, demonstrating how our findings can be interconnected and applied;
To evaluate performance in distinguishing the contrasting affective states connected to separate targets in one text, which is the key in ABSA and ABEA tasks, this study suggests a new evaluation dataset named MTME. The proposed KOTAC model achieves particularly outstanding results.
The outline of this paper is as follows:
Section 2 presents a review of related research.
Section 3 details the methodology of the proposed model.
Section 4 describes experimental settings and scenarios.
Section 5 discusses the findings from our experiments.
Section 6 draws conclusions.
2. Related Work
2.1. Long Short-Term Memory
Long Short-Term Memory (LSTM) networks [
26] are utilized in NLP to handle sequences with long-range dependencies and capture sequential characteristics effectively. An LSTM unit computes its state through a series of gates: an input gate
, a forget gate
, and a cell input modulation gate
. These gates collectively determine the flow of information and allow the network to retain or discard information across time steps, which is essential for handling long sequences. The core LSTM equations are as follows:
At each time step, the memory cell
is updated by combining the previous cell state
, which is modulated by the forget gate, with the current candidate values produced by the input gate and the cell input modulation gate. This update is captured by the following:
The hidden state
, which serves as the output at each step, is then computed by filtering the updated cell state through the output gate
, modulated by the non-linearity of the hyperbolic tangent function as follows:
2.2. Attention
The advent of attention mechanisms [
27,
28] has led to significant performance improvements across various tasks in NLP. Among these, scaled-dot product attention has been used for pivotal features, enhancing models’ ability to focus on relevant parts of the input data [
29]. The mechanism computes the attention scores by taking the dot product of the query
Q with all keys
K, scales these scores by the square root of the dimension of the keys
to prevent extremely large values, and applies a softmax function to obtain the final weights on the values
V. Mathematically, it is represented as follows:
A special case of this mechanism is self-attention, where Q, K, and V are all the same matrix derived from the input data. This self-referential approach enables models to assess and assign importance to each part of the input relative to the rest, facilitating the extraction of more nuanced and context-rich information.
2.3. Transformer and Pre-Trained Language Model
The self-attention mechanism is a foundational component of Transformer models, which power many of the current state-of-the-art PLMs. The Transformer enables PLMs to understand and generate human language with remarkable accuracy. In a standard Transformer, for each token
in the input sequence, the initial embedding
is computed by combining token and positional embeddings as follows:
Here, represents the embedding matrix that transforms token indices into embeddings. represents the positional embeddings that provide information about the position of each token in the sentence. These embeddings collectively contribute to a comprehensive representation of each token considering both its intrinsic value and its contextual positioning within the sequence.
Each head
j in a multi-head self-attention layer independently computes a distinct attention output as follows:
This architecture through multiple heads allows the model to attend to information from different representation sub-spaces at different positions.
Following the attention mechanisms, each Transformer layer includes a position-wise Feed-Forward Network (FFN), which applies further transformations to the attended representations as follows:
To stabilize the learning process and enhance model training, a layer normalization step is employed:
represents the hidden state for each input token, effectively encapsulating the information learned by the model through the processing layers for each token.
BERT revolutionizes NLP by pre-training on extensive datasets using bidirectional training, capturing nuanced contexts [
30]. It introduces special tokens. The classification token ([CLS]) represents the start of inputs and encapsulates the entire sentence’s information. [SEP] separates sentence pairs, facilitating tasks like question answering and natural language inference. This structure has enabled BERT to achieve remarkable performance across a variety of sentence pair classification tasks.
BERT has laid the foundation for advanced language models, yet it requires substantial computational resources for pre-training. Subsequently, ELECTRA, introduced in [
31], presents a more sample-efficient approach known as replaced token detection. Unlike BERT’s Masked Language Model (MLM) strategy, which masks words and predicts their original values, ELECTRA transforms the pre-training task by generating and distinguishing between “real” and “replaced” tokens in the input text. This new pre-training method is more efficient than the MLM because the model learns over all input tokens.
Fine-tuning PLMs on the task-specific datasets allows them to adapt the general pre-training to specialized applications, significantly improving their performance on downstream tasks. The parameters are refined by adjusting the initial parameters through a series of updates. The equation below illustrates this process.
Here, represents the parameters pre-trained on a general dataset, and denotes the parameters after fine-tuning. The term is the learning rate, which controls the step size during the update. The function is the loss function, computed using the task-specific dataset D. The gradient guides the parameter updates to minimize the loss, thus refining the model’s ability to perform specific tasks more effectively.
2.4. Aspect-Based Sentiment Analysis and Aspect-Based Emotion Analysis
ABSA is a sub-field of sentiment analysis that identifies the sentiment with respect to specific aspects of a subject of concern [
6,
32]. This task involves four elements: aspect term, aspect category, opinion term, and sentiment polarity [
33]. For example, in the sentence “The coffee is terrible”, the elements are coffee, food, terrible, and negative. Targeted ABSA (TABSA) focuses on determining the sentiment polarity given an aspect term. Research in ABSA has predominantly utilized sentence pair classification since the advent of BERT [
11,
12,
34]. Notable research has been conducted using context-aware representations of BERT to emphasize targets. Ref. [
14] composed sequence inputs as “[CLS] + context + [SEP] + target + [SEP]” for the vanilla BERT, and ref. [
35] tested reversing the order of context and target. These input forms are regarded as the most basic structure for ABSA. Ref. [
14] employed multi-head attention, opting for separate inputs for context and target—formatted as “[CLS] + context + [SEP]” and “[CLS] + target + [SEP]”, respectively—using padding to manage varying lengths. This interactive approach could capture target-specific context representation. However, they did not explore combining sentence pair and attention approaches. Ref. [
13] designed a method where max pooling is applied to vectors corresponding to the target words, taking the maximum value across each dimension of these vectors, and then this result is concatenated with the [CLS] token before proceeding to the fully connected layer for classification. They contrasted this method with the model where outputs at the [CLS] are directly followed by a fully connected layer, not incorporating any target information in the input. This comparison suggests the importance of awareness of the target information. Ref. [
36] proposed that solely relying on the final [CLS] for classification can ignore rich semantic knowledge contained in the intermediate layers, thereby utilizing all the intermediate layers of [CLS] tokens for LSTM or attention pooling.
Emotion analysis explores deeper affective states beyond merely classifying sentiments as positive or negative.
Figure 1 shows differences between sentiment analysis and emotion analysis. One key distinction is that not all positive or negative sentiments are equal [
5]. This task involves categorizing emotions into predefined categories such as joy, anger, fear, and sadness using multi-label classification. This approach enables the identification of multiple emotions that may coexist within a single text. Because emotion has a complex spectrum, emotion analysis has also been studied in a way that uses the correlation between emotions [
37,
38]. Transitioning from sentiment to emotion in the context of ABSA leads to ABEA. ABEA shifts the focus from determining sentiment polarity to classifying the emotions related to aspects [
9,
10].
Table 2 shows the primary datasets of the sentiment analysis and emotion analysis tasks.
3. Methods
In this section, we explain our KOTAC model, which uses scaled-dot product attention [
29] to apply attention to the target, where the query is the target, and the key and value are the sentences. Unlike multi-head attention, which addresses the entire context and relationships within the data, we opted for scaled-dot product attention to emphasize specific segments, particularly the target. We made this decision to enhance computational efficiency and reduce the risk of overfitting. By concentrating on specific parts of the input data, the model assigns weights to target words. This approach is expected to more effectively capture emotional states based on the relevance to the sentences and the target.
Figure 2 shows the entire process of our methods.
Table 3 shows all considerations for our model.
Section 3.1 and
Section 3.3 describe the process of constructing target vectors.
Section 3.2,
Section 3.4, and
Section 3.5 explore the application of attention output.
3.1. Target Selection for Query
The designation of the target that becomes the query is a pivotal factor in our attention mechanism. To clarify the target included in the sentence, we construct the input in the form of a sentence pair. For example, the input for the sentence “The cafe’s atmosphere is good, but the coffee is terrible” with the target “service” is formed as “[CLS] The cafe’s atmosphere is good, but the coffee is terrible. [SEP] coffee [SEP]”. The target (coffee) appears twice: before the first [SEP] token as part of the sentence, denoted as the Internal Target (InT), and after the first [SEP] token, standing alone, denoted as the External Target (ExT). This distinction matters for how the attention mechanism processes the target and its relationship with the context. InT is computed by proximity with adjacent words because it is influenced by its context. ExT is isolated by two [SEP] tokens, devoid of the surrounding context.
3.2. Source for Query, Key, and Value
Before applying attention, the input—combined with the target and sentences—is processed through the model to obtain the query, key, and value vectors. We tried two methods to derive the vectors: directly using the contextualized embeddings of an encoder-based PLM and enhancing these embeddings through an LSTM layer [
26]. The PLM efficiently captures relationships among all words in a sentence. For the embeddings generated by the PLM, we consider the input sequence
, where each input token
is transformed across multiple layers of the Transformer architecture. The representation of these tokens at any layer
l is updated based on all other tokens’ representations as follows:
is the initial embedding matrix, are the hidden states at layer , and E is the final embedding matrix output by the PLM.
Bidirectional LSTM (BiLSTM) excels at capturing the sequential characteristics by processing information in both forward and backward directions [
40,
41].
Figure 3 shows the processing flow within the BiLSTM layer, illustrating how it enhances the PLM-derived embeddings. For the BiLSTM-enhanced output, the encoded vectors
E obtained from the PLM are processed through a BiLSTM layer to capture both contextual information and sequential characteristics [
42,
43].
The BiLSTM processes embeddings as follows:
By concatenating the forward and backward LSTM outputs, BiLSTM ensures a richer representation that incorporates both past and future contexts within the sequence.
3.3. Query Vector Formulation
After selecting the target used as a query and obtaining the query vector, we considered three methods for formulating the query vector: padding, average pooling, and max pooling. These methods aim to accommodate varying lengths and condense information, thereby enhancing the ability to process and interpret the target effectively.
Padding is commonly used to standardize sequence lengths for parallel processing. Padding tokens, typically zeros, are dynamically appended within the batch to ensure that all sequences align with the longest sequence length.
Average pooling summarizes the overall information of the sequence to a single vector by taking the mean of the word embeddings. It treats every word equally without distinguishing the importance of each word.
Max pooling selects the maximum value across the embeddings in each dimension. It is particularly beneficial when certain features of a sequence are more important.
Following the formulation of query vectors, we applied the attention as delineated in Equation (
7).
3.4. Sentence Representation
While the attention mechanism’s output offers a refined focus on the target, we tried different strategies to supplement and modulate the attention output. Attention output might not always be sufficient to capture the overall expression, such as a consistent emotional state in a sentence. Therefore, relying solely on attention output might sometimes lead to overconcentration on specific parts of the text, potentially neglecting the broader context or holistic features. Our approach to deal with this matter is combining a sentence representation with attention output. We used [CLS] of the PLM and the final hidden state (h_n) of the BiLSTM as the sentence representations.
3.5. Combining Method
This subsection explores how we integrated the attention output obtained from Equation (
7), denoted as
, with the additional sentence representation derived in
Section 3.4, denoted as
.
We refer to the approach of using only the attention output as the final input before the loss function as X.
Element-Wise Addition (Plus) merges vectors by adding corresponding elements. This method requires both sources to have the same dimension and creates a new vector of the same dimension as the input. This approach is useful when the attention output and the sentence representation are equally informative and complementary, allowing for a direct blend of the two factors.
Concatenation (Cat) joins elements from one vector with another to form a single, extended vector. This method is advantageous for retaining the full information content without any loss.
A Gating Mechanism (Gate) learns the relative contributions of two sources dynamically. The combined input for the gate is formed by concatenating the attention output with the chosen sentence representation.
A linear transformation is then applied to this combined input, followed by a sigmoid activation function.
The resulting values, between 0 and 1, function as the gate, modulating the extent to which each source contributes to the final output.
The obtained final output
O is transformed into a prediction space suitable for multi-label classification.
o is projected through a linear layer to produce logits.
N represents the number of emotion labels.
Then, the logits are fed into a BCEWithLogitsLoss function, which combines a sigmoid layer and binary cross-entropy loss (BCELoss) into one single class. BCEWithLogitsLoss is more numerically stable than using a plain sigmoid followed by BCELoss.
6. Conclusions
In this paper, we proposed a model, the Korean Target-Attention-Based Emotion Classifier (KOTAC), to solve Korean Targeted Aspect-Based Emotion Analysis (TABEA) tasks. This model applies scaled-dot product attention to focus on the target, with the target serving as the query and the sentences as both key and value. We explored various methods to generate and apply the target vector. Given the importance of extracting the target, which serves as the query, we structured the input in the form of a sentence pair. We considered two approaches: using an internal target within the sentence as the query and using an external target outside the sentence as the query. To obtain the query, key, and value vectors, we tried either the raw output of a PLM or BiLSTM-enhanced output. While the attention mechanism provides a focused analysis of the target, we also employed diverse methods using sentence representations, recognizing the potential risk that attention output alone might not adequately capture the sentence’s overall context.
Our proposed KOTAC model effectively identified and focused on affective parts related to the target even in the complex characteristics of the Korean language and social media data. Through comprehensive analysis and comparison of various methods, we identified and presented the most suitable configurations for the TABEA task, as detailed in
Table 5. Using these optimal configurations, we then conducted additional experiments involving MTME and non-sentence pair formats, presented in
Table 6. We confirmed that indirect target representation via target attention led to improvement even without direct target inclusion. Specifically, in the context of Multi-Target Multi-Emotion (MTME) scenarios where contrasting affective states are connected to separate targets within one text, our model yielded a performance enhancement of 0.72% in F1 micro score over the baseline, demonstrating its effectiveness in complex emotional analysis.
Despite our attempts with various model configurations, simpler models often exhibited superior performance. This indicates the possibility of further analysis and optimization. Additionally, the MTME dataset was too small, potentially affecting the generalizability and robustness of the findings. For future work, obtaining a larger dataset is crucial as it will enable a more precise and comprehensive analysis of the models’ performance. In another approach, we propose a dual strategy: employing the configuration that uses only the attention output for MTME scenarios to spotlight target-focused analysis and including sentence representation for texts exhibiting consistent emotional states, thereby tailoring our analysis to the specific demands of each scenario. In future research, our model’s methodology might be applied to other tasks, such as entity recognition and relation extraction, where the designation of the target is important just as in the TABEA task.