Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study

Cai, Huali; Shao, Xuanya; Zhou, Pengpeng; Li, Hongtao

doi:10.3390/electronics14030434

Open AccessArticle

Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study

¹

China Academy of Civil Aviation Science and Technology, Chaoyang District, Beijing 100028, China

²

School of Economics and Management, China University of Petroleum, Beijing 102249, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(3), 434; https://doi.org/10.3390/electronics14030434

Submission received: 19 November 2024 / Revised: 11 January 2025 / Accepted: 20 January 2025 / Published: 22 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Customer complaints play an important role in the adjustment of business operations and improvement of services, particularly in the aviation industry. However, extracting adequate textual features to perform a multi-label classification of complaints remains a difficult problem. Current multi-label classification methods applied to complaint texts have not been able to fully utilize complaint information, and little research has been performed on complaint classification in the aviation industry. Therefore, to solve the problems of insufficient text feature extraction and the insufficient learning of inter-feature relationships, we constructed a multi-label classification model (MAG, or multi-feature attention gradient boosting decision tree classifier) for civil aviation service quality complaint texts. This model incorporates multiple features and attention mechanisms to improve the classification accuracy. First, the BERT (Bidirectional Encoder Representations from Transformers) model and attention mechanisms are used to represent the semantic and label features of the text. Then, the Text-CNN (a convolutional neural network) and BiLSTM (bidirectional long short-term memory) multi-channel feature extraction networks are used to extract the local and global features of the complaint text, respectively. Subsequently, a co-attention mechanism is used to learn the relationship between the local and global features. Finally, the travelers’ complaint texts are accurately classified by integrating the base classifiers. The results show that our proposed model improves the multi-label classification accuracy, outperforming other modern algorithms. We demonstrate how the label feature representation based on association rules and the multi-channel feature extraction network can enrich textual information and more fully extract features. Overall, the co-attention mechanism can effectively learn the relationships between text features, thereby improving the classification accuracy of the model and enabling better identification of travelers’ complaints. This study not only effectively extracted text features by integrating multiple features and attention mechanisms, but also constructed a targeted feature word set for complaint texts based on the domain-specific characteristics of the civil aviation industry. Furthermore, by iterating the basic classifier using a multi-label classification model, a classifier with higher accuracy was successfully obtained, providing strong technical support and new practical paths for improving the civil aviation service quality and complaint management.

Keywords:

multi-label classification; multi-feature extraction; complaint text; civil aviation; natural language processing; co-attention mechanism

1. Introduction

Complaint forms are an important feedback channel for air travel customers. Complaint information often exposes problems in the operations of enterprises and can play an important role in guiding changes in business operations and improving services based on customers’ needs [1]. Therefore, quickly and accurately categorizing customer complaints, particularly by converting unstructured data into structured forms, is crucial for the handling of complaints and the exploration of customer needs [2]. It helps the complained party to have a more intuitive understanding of the overall situation of the complaint, so as to systematically handle similar issues. A complaint text can consist of multiple complaint topics, meaning each instance corresponds to more than one kind of label, and the number of labels is uncertain. Multi-label classification has recently become a popular research topic. However, current multi-label classification methods for complaint texts are not able to fully describe travelers’ complaint information, and extracting robust textual features remains challenging.

With the development of machine learning and deep learning techniques, increasing numbers of researchers have applied such methods to complaint text classification. Traditional machine learning methods such as support vector machines (SVMs), K-nearest neighbors (KNNs), and plain Bayesian methods have been introduced to solve text classification problems. With regard to deep learning models, recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their variants are widely used. Recently, pre-trained language models (PLMs) such as BERT (Bidirectional Encoder Representations from Transformers) have achieved state-of-the-art results on a number of natural language processing (NLP) tasks and have seen widespread use. However, these approaches are based on generalized data and often ignore the importance of domain-specific features in classifying complaint texts. Moreover, when applied directly to data from civil aviation, these models often achieve lower performance compared to generalized domains.

In civil aviation, travelers’ complaints are often directed at different service categories, such as ticketing, flight irregularities, baggage, and air service. Complaints falling under different categories reflect the service quality from different perspectives and to different degrees. To address gaps in the research on multi-label classification for improving the civil aviation service quality and improve upon the performance of generalized methods, we propose a textual multi-label classification model for complaints that incorporates multiple features and attention mechanisms. Specifically, the method accounts for service quality feature words, accurately extracts the textual features of the complaints, and successfully learns the relationships between features, thus improving the multi-label classification model’s performance.

The main contributions of this study can be summarized as follows: (1) The proposed model can extract textual features more fully by integrating multiple features and attention mechanisms; this remedies the common problem of insufficient textual feature extraction. (2) The domain-specific characteristics of the civil aviation industry were accounted for by constructing a feature word set of complaint text labels, improving the effectiveness of the multi-label classification model. (3) A classifier with higher accuracy was obtained by iterating the base classifier using GBDTs (gradient boosting decision trees), enabling the automatic selection of important features and further improving the performance of the multi-label classification model. Experiments showed that our constructed model performs well at its task of the multi-label classification of travelers’ complaint texts. Compared with the traditional BR-GBDT model, the F1 value increased by 3.08 percentage points, which is also better than other modern algorithms.

We begin by introducing related research and theoretical methods of text classification and multi-label classification, analyzing their respective advantages and disadvantages. Next, the multi-label classification model involving multiple features and attention mechanisms is introduced in detail. Third, we describe an experimental evaluation we conducted of the proposed model and analyze its performance. Finally, applications of the model are discussed and improvements upon previous work are summarized; moreover, avenues for future research are proposed.

2. Related Work

With the rapid development of the internet and social media, the volume of customer complaint text data has increased dramatically. Extracting useful information from these complaint texts has become an important research direction in the field of natural language processing. In particular, multi-label classification tasks are crucial in customer complaint text classification, as a single complaint may involve multiple issues or topics. This section reviews recent research progress in this field, with a particular focus on classification studies for complaint texts and the latest developments in multi-label classification methods.

2.1. Classification of Complaint Texts

Text classification refers to the automatic categorization of text information according to set labels using natural language processing techniques. Text classification methods are mainly divided into two categories: machine learning and deep learning methods. The most commonly used machine learning algorithms for customer complaint text classification are support vector machines (SVMs) [3,4,5], K-nearest neighbors (KNNs) [6,7,8], Naive Bayes [9,10], and others. However, many of these algorithms suffer from the problems of high dimensionality and data sparsity when performing text classification, and their performance is generally lacking. In recent years, with the rise of deep learning, more researchers have used neural networks for text classification. This shift towards deep learning has improved the performance of text classification models, particularly when handling complex or large datasets.

Most of these algorithms focus on a certain subtask to improve the performance of the overall model. For instance, some algorithms focus on more comprehensive representations of the text. Ref. [11] proposed the Vector Space Model (VSM), which treats texts as a collection of words and considers elements such as the TF (term frequency) or TF-IDF (term frequency–inverse document frequency); this is the most widely used type of feature representation for text classification tasks. For example, Zhu [12] used the VSM (Viable System Model) to calculate the information gain of mutual information and determine the characteristics of news text data. However, this type of model does not consider the positional relationship between words. To address the high dimensionality of text feature vectors and data sparsity, Blei et al. [13] proposed the Latent Dirichlet Allocation (LDA) topic model, which maps the high-dimensional word item space to a low-dimensional implicit semantic space, effectively reducing the dimensionality of text vectors. This method has been widely used in the field of text classification. For example, Pavlinek et al. [14] presented the ST LDA method for text classification. Although the probabilistic topic model considers the correlations between words, it cannot utilize the contextual information constituted by neighboring words. Therefore, Word2Vec, as proposed by Mikolov et al. [15] at Google, became one of the most widely used word embedding models. The Word2Vec word vector model can solve high-dimensional sparseness and can also introduce semantic features in combination with the context [16]. However, Word2Vec is a relatively shallow method because it only places pre-trained information in the first layer of the model. Compared with the above traditional text representation models, the text vector representation model BERT reached a leading level in tasks of text classification and information extraction. It can effectively alleviate the issues of the multiple meanings of words, repetition of first words, and long-term dependency.

Some studies have focused on optimizing feature extraction methods. Deep learning algorithms commonly used for text classification are recurrent neural networks (RNNs), convolutional neural networks (CNNs), and their variants. Hu et al. [17] added long short-term memory (LSTM) and an attention model to an independently recurrent neural network (IndRNN) to avoid suffering the exploding problem and gradient vanishing of training in RNNs. Chen et al. [18] used a CNN for encoding and an RNN for decoding (i.e., an encoder–decoder framework) to capture global and local text semantics and improve the performance of a multi-label classification model. Text-CNNs employ different sizes of convolutional kernels to extract local features shared between words within the range of the kernels, but cannot reveal long-distance dependencies due to the limitations of the size of the kernels [19]. RNNs and their variants, long short-term memory (LSTM) and bidirectional long short-term memory (BiLSTM) models, are a class of neural networks designed to model sequences. They are inherently suitable for modeling text and can capture its global structural information, but can also face the issues of gradient dispersion and gradient explosion [20]. Thus, combining these approaches is expected to lead to more robust and adaptable text classification systems.

An attention mechanism has been applied in many natural language processing tasks. Unlike maximum pooling, which only selects the most important information, attention mechanisms learn the proportion of the contribution of each part of the text to the overall semantic information. Important words or phrases are assigned higher weights so that the key patterns are identified; however, word order information is ignored, resulting in an inability to use the text’s global structural information.

It is clear that individual classification algorithms have their respective limitations, and combining different deep learning networks to obtain better text classification results is a promising research topic. Deep learning fusion methods can be divided into two categories according to their network structure: single-channel and multi-channel. Single-channel networks generally process text vectors sequentially, and the feature extraction process is usually expressed as a serial structure. Ma et al. [21] combined the advantages of BERT and multi-scale CNNs to construct a multi-label classification model, which first captures the local features of the text and then combines utterance features of different scales in the convolutional layers to obtain richer semantic feature information, thus improving the performance in multi-label classification. Gu et al. [22] combined the basic principles and network structure of a CNN in deep learning to construct a community security risk prediction model based on the improved single-channel CNN and decision tree, which extended the single channel to a multi-channel network, greatly expanding the feeling field of the CNN and improving the accuracy of the prediction model. Rhanoui et al. [23] proposed a CNN-BiLSTM model with an attention mechanism, which can extract both the local semantic and multi-scale contextual features of words in sentences, compensating for the shortcomings of CNNs in global feature extraction. Kavianpour et al. [24] employed a deep learning model based on convolutional neural networks (CNNs), bidirectional long short-term memory (BiLSTM), and an attention mechanism, as well as a zero-order hold (ZOH) preprocessing methodology to improve the performance of prediction results. The single-channel network structure means that the feature extraction process is not independent, and the structure of sequential processing can lead to feature extraction issues at the ends of texts.

Multi-channel feature extraction networks can use different methods to extract features independently from the input sequence, which helps alleviate the problem of short and sparse features in complaint texts. Xu et al. [25] used a Text-CNN to capture local features and improved the effect of multi-label text classification by using a dual-channel aggregation of historical text and labels through Bi-LSTM. Traditional multi-channel customer complaint text classification methods directly use concatenation or the dot product for the fusion of feature vectors output from different extraction networks; this does not account for inter-feature relationships and interactions, which cause these relationships not to be learned. As a resource allocation method, an attention mechanism can assign different weights to different parts of the encoder input, thus solving the problem of the lack of differentiation of input features [26]. Yan et al. [27] proposed R-Transformer-BiLSTM, a model that combines R-Transformer, BiLSTM+CRF, and self-attention to enhance the multi-label text classification efficiency and accuracy. Zhao et al. [28] proposed a multi-label text classification model that uses keyword-based label representation and self- and interactive attention mechanisms to extract features, resulting in improved performance. Wang et al. [29] proposed SemFA, a large-scale multi-label text classification model based on semantic features and associative attention, which establishes potential associations between labeled features and text features through an associative attention mechanism. By synthesizing global information and local semantic features at different granularities, the model outperforms most existing large-scale multi-label classification algorithms.

A co-attention mechanism involves feeding multiple feature vectors into a network at the same time and jointly learning their respective attention weights. Thus, it is often used for the mutual extraction of key information shared between image and text features in visual question and answer tasks [30]. Hu et al. [31] designed an image–text matching-aware co-attention mechanism which captures the alignment of images and text for better multimodal fusion to improve fake news detection. The multi-head attention mechanism is a combination of attention mechanisms proposed by Vaswani et al. [32], which maps the input features to different subspaces for representation. In this way, the attention part can jointly focus on the information from different representation subspaces, which improves the learning effect. Khan et al. [33] proposed a novel MSER model that uses a multi-headed cross-attention mechanism for the deep feature fusion of audio and text cues, achieving SOTA results with a 4.5% improved recognition rate on the IEMOCAP and MELD datasets. Wang et al. [2] used dual-channel word embedding to enrich text representation and inputted features extracted by a Text-CNN into a corresponding BiLSTM network to extract the inflectional order information. They then fused the extraction results of each channel by using the multi-head collaborative attention mechanism, thus improving the text classification results.

2.2. Multi-Label Text Classification

Many methods have been proposed for multi-label text classification, with traditional methods belonging to two main categories: Problem Transformation (PT) and Algorithm Adaptation (AA). The first type is the most classical method, whose strategy is to convert a multi-label classification problem into a series of single-label classification problems, so that existing single-label learning algorithms can be applied to solve the problem. Representative PT methods include binary relevance (BR) [34,35,36], label powerset (LP) decomposition [37,38], and classifier chains (CCs) [39,40,41]. The BR method decomposes the multi-label classification problem into multiple binary classification problems for processing; the LP method transforms the multi-label classification problem into a multi-classification problem by viewing label combinations as classification categories; and the CC method transforms the multi-label classification task into a chain of binary classification problems, and the subsequent chain of binary classifiers makes predictions based on the previous ones.

The strategy of the second class, algorithmic adaptation methods, is to improve and extend current single-label learning algorithms so that they can be used in multi-label classification tasks. Representative methods include the ML-DT (multi-label decision tree), Rank-SVM (ranking support vector machine), and ML-KNNs (multi-label K-nearest neighbors). The MLDT method performs classification operations by constructing a decision tree; the Rank-SVM method handles the multi-label classification problem by means of an SVM; and the ML-KNN method improves upon the KNN algorithm to handle multi-labeled data.

Compared with individual classifiers, the performance of integrated classifiers is more robust. Therefore, some researchers have focused on integration algorithms, which combine multiple weak classifiers to obtain a new classifier with a better generalization ability. Common integration methods mainly include bagging and boosting. Bagging [42] is the most famous example of a parallel integration learning method, which creates an ensemble of classifiers over bootstraps through learning and then generates diverse classifiers [43]. Boosting [44,45,46] improves the accuracy of base classifiers by iterating them to detect misclassified data more finely, thus improving the classification accuracy. Freund et al. [47] proposed the adaptive boosting (AdaBoost) integration algorithm, which can effectively avoid overfitting compared to other classifiers. In 2000, Schapire et al. [48] applied the boosting algorithm to text categorization, using experiments to demonstrate the superiority of the boosting integration algorithm, with better performance compared to that of a single classifier. Winata et al. [49] applied the boosting algorithm to text categorization and demonstrated that adaptive boosting (AdaBoost) has high potential in boosting the performance of text categorization. Subsequently, Friedman H [50] proposed the GBDT algorithm: by iteratively training weak decision trees and optimizing the loss function using gradient descent, a classifier with high accuracy was obtained. The GBDT algorithm is able to deal with high-dimensional, nonlinear, and noisy data and can automatically select important features with high accuracy. Moreover, it has strong generalization and anti-interference abilities.

Based on the current state of research, we propose a multi-channel feature extraction model that leverages semantic features and label features, specifically a BR-GBDT multi-classification integration model. First, the BERT model is used to obtain the contextual representation of the text, and the association rule is used to obtain the label features of the text. Then, local and global features are extracted by a Text-CNN and Bi-LSTM, respectively, and an attention mechanism is used to achieve synergy between the features; in this way, the features can be successfully harnessed by the classification model to improve the category prediction accuracy.

3. Study Design

3.1. Modeling Framework

The multi-label classification model of civil aviation service quality complaint texts incorporating multiple features and attention mechanisms proposed in this study mainly consists of an input layer, a text vectorization representation layer, a multi-channel feature extraction layer, a feature interaction layer, a multi-label classification layer, and an output layer. The overall structure is shown in Figure 1.

In the input layer, raw text data are received and preprocessed to ensure that they enter the model in the appropriate format and quality. This includes removing irrelevant information, performing necessary text cleaning, and formatting operations.

In the text vectorization representation layer, the BERT model is used to obtain a vectorized representation of the text at a word-vector-level granularity, which is then combined with the attention weight based on the label information. The result of the vectorized representation then fuses the word embedding and the label embedding. The label information is used to guide the learning of the text feature representation.

In the multi-channel feature representation layer, the text vectors are input into the feature extraction layer involving local and global feature extraction. The local feature extraction part is a Text-CNN consisting of three convolutional kernels of different sizes, which is used to extract local features shared between word vectors within the range of the kernels. The different layers of local features obtained from the three channels are spliced together and output as local text features. The global feature extraction part is a BiLSTM network that captures the longer range dependencies. The text sequence is modeled in both forward and backward directions using the BiLSTM network to better capture bidirectional semantic dependencies; the feature outputs from the two directions are spliced together as the global text feature output.

In the feature interaction layer, local and global features are jointly input into a multi-head co-attention mechanism to extract key information in another feature vector, revealing the interactions between features. Finally, the features are spliced with the original feature vector as inter-feature relationship information to enable feature fusion.

In the multi-label classification layer, a separate binary classifier is trained for each label using the BR algorithm, with binary classifiers being constructed using SVM, KNN, NB, and DT methods among other algorithms. Then, the most effective classifier is used to predict the samples.

In the output layer, different thresholds are set for each label as needed to control the predictions, and the final output is the category labels.

3.2. Text Vectorization Representation Layer

Text representation is the process of transforming text into a numerical matrix, which is the first step of our proposed method. Traditional text vectorization models include Word2Vec, Doc2Vec, LDA, and others, but these models cannot solve the problem of the multiple meanings of words and cannot account for labeling information. The BERT model is a pre-trained language model released by Google [51], which can solve the problem of multiple word meanings. Tagged feature words are frequent features obtained based on association rule mining, which can account for the correlations between text and tags. The model learns a text vector representation for each tag based on the tag features, which improves the classification accuracy to an extent.

3.2.1. BERT Word Embedding

Text information is intuitively understandable for human beings but cannot be directly processed by computers. Therefore, it is necessary to transform text into numerical data. Traditional text representation methods include one-hot and matrix decomposition, but they have the disadvantages of dimensional disasters and a high computational cost. With the development of neural networks, word embedding has emerged as a new method of word representation, which uses a continuous, low-dimensional, dense vector to represent words, referred to as a word vector. Word2Vec and Glove are static word vector representations, meaning that for any word, its vector is constant and does not change with its context. However, static word vectors still cannot solve the problem of multiple word meanings.

In order to better represent the word features of text, we adopted a pre-trained model, BERT, to calculate the contextual representation of each word. Based on how different contexts can have different representations of the same word, BERT consists of a multi-layer transformer, accepts the input of a sequence of 512 words, and outputs the representation of the sequence. Its process is shown in Figure 2. For a document consisting of k words as an input,

W = [w_{1}, w_{2}, \dots, w_{k}]

, the corresponding word vector

E_{w} = [e_{1}, e_{2}, \dots, e_{k}]

is obtained through the blue transformer module in the BERT model.

3.2.2. Label Feature Word Embedding

To consider the characteristics of civil aviation service quality and to improve the performance of the classification model, we independently constructed a feature word set for each label and assigned a higher weight to the feature words corresponding to the labels in the text vector representation. In this way, we made them play a greater role in the classification of each label. The steps of this workflow are described as follows:

Feature word set construction for civil aviation service quality complaint text labeling:
First, the corpus corresponding to each tag was segmented and lexically labeled sentence by sentence. The noun component of the sentence was taken as the basic transaction, and single-word nouns were filtered to create the basic transaction set. Then, the frequent 1-item sets $L_{1}$ were obtained using the association rule Apriori algorithm, and the frequent 2-item sets $L_{2}$ were obtained from $L_{1}$ .
The Apriori algorithm for association rules is a classic algorithm widely used in the field of data mining, aimed at discovering interesting association relationships between item sets in a dataset. This algorithm adopts an iterative strategy of a layer-by-layer search, first identifying all frequent item sets, then constructing new candidate sets based on these frequent item sets and calculating their support to filter out the next order of frequent item sets. This process continues until no more frequent item sets containing more items can be found.
Since the frequent item set mining algorithm of association rules does not consider the positional relationships of nouns in the complaint text, the two nouns of $L_{2}^{'}$ may not form a phrase. So, when filtering according to the dependency relation $L_{2}$ of non-phrase features, only the noun phrases whose relation was a noun phrase (NP) were considered to obtain a new set of frequent 2-item sets, $L_{2}^{'}$ . Then, from the candidate feature set $L_{1}$ , we introduced domain dictionaries such as “civil aviation” and “civil aviation thesaurus” to supplement the mining of infrequent domain features. This enabled us to collect more comprehensive and accurate domain features, which we used to obtain $L_{1}^{'}$ .
The redundant features were then filtered by the minimum independent support. The minimum independent support is the support of the frequent 1-item set containing the feature word minus the absolute support of the frequent 2-item set containing the word. If the minimum independent support of a feature word was less than a threshold value, T, the word was filtered as a redundant feature, and the final set of frequent 1-item sets satisfying the independent support constraint was obtained, referred to as $L_{1}^{″}$ .
Finally, the semi-automatic updating of the civil aviation domain feature dictionary was realized by manually updating the dictionary with feedback on the mined frequent features, which saves human effort. After candidate feature mining, pruning, filtering, and infrequent item set supplementation, the domain feature set was finally obtained as $F = L_{2}^{'} ⋃ L_{1}^{'} ⋃ L_{1}^{″}$ .
Feature word embedding:
(1) Training the Word2Vec model:
First, we chose the model type. Word2Vec has two main model types, Skip-gram and CBOW (continuous bag of words). Skip-gram usually performs better on small datasets, while CBOW is faster for large datasets.
Next, we set the hyperparameters of the model, including the vector dimension, window size, number of training iterations, and others.
Then, the model was trained. The Word2Vec model was trained using preprocessed text data. The model learns word vectors by predicting the context of a word (Skip-gram) or predicting a word using the context (CBOW).
(2) Obtaining word vectors for the topic feature words:
For a feature word set consisting of t words, $F = [f_{1}, f_{2}, \dots, f_{t}]$ , we obtained the word vectors of label feature words from the trained Word2Vec model as $E_{f} = [e_{f_{1}}, e_{f_{2}}, \dots, e_{f_{t}}]$ .

3.2.3. Text Vector Representation Based on Attention Weighting of Label Features

In this subsection, we propose a text vector representation based on the attentional weighting of label features. First, we calculated the IDF value for each feature word, which was used to measure the importance of that feature word in the whole corpus:

I D F_{f} = log \frac{N}{D F_{f}}

(1)

Here, N is the total number of comment documents under this tag, and

D F_{f}

is the number of documents in which the feature word f appears.

Second, the attention weight of each feature word was computed using TF-IDF, which was represented as a weight vector of the feature word. For a feature word in a document, f, its TF-IDF weight is

T F I D F_{f} = T F_{f} \times I D F_{f}

(2)

where

T F_{f}

is the frequency of the feature word f in the current document.

Then, the computed attention weights were used as the weight coefficients of the feature word embedding matrix and multiplied with the corresponding word vectors in order to construct the label feature word embedding weight matrix

E_{f - A t t e n t i o n}

.

E_{f - A t t e n t i o n} = E_{f} ⊙ T F I D F_{f}

(3)

Here,

E_{f}

is the feature word vector of the label, and

T F I D F_{f}

is the attention weight of the feature word, and ⊙ denotes the dot product operation of the vectors, which multiplies the corresponding positional elements of the matrix.

Finally, the BERT word embedding matrix

E_{w}

was multiplied with the label feature word embedding weight matrix

E_{f - A t t e n t i o n}

to obtain the label feature attention-weighted text vector representation X:

X = E_{w} ⊙ E_{f - A t t e n t i o n}

(4)

In Equation (4),

E_{w}

is the word vector obtained from the BERT model.

3.3. Multi-Channel Feature Extraction Layer

Although the vectors obtained from the text representation layer contain contextual relationships and labeling information, they still have the problems of noise and high dimensionality. At the same time, the features of the complaint text are sparse and need to be extracted and reduced in dimension by the feature extraction network. In this section, we describe how Text-CNN and BiLSTM multi-channel feature extraction networks with parallel structures were used to extract the local and global features of the text, respectively.

3.3.1. Text-CNN-Based Localized Feature Extraction

The Text-CNN [51,52,53] is a network with a strong text feature extraction capability. In this study, a Text-CNN with three different sizes of convolutional kernels was used to extract feature information within the range of the kernels, and maximum pooling was performed to reduce the dimensionality. The Text-CNN’s structure is shown in Figure 3.

The output, X, of the text representation layer is first fed into the Text-CNN, and the convolutional kernel performs sequential convolution operations in the text matrix to extract local features. A local feature is extracted as

c_{j}

, as shown in Equation (5):

c_{j} = f (w \cdot y_{j : j + r - 1} + b)

(5)

Here,

f (\cdot)

is a nonlinear activation function, ReLU, and b is a bias term, and w is a convolutional kernel parameter matrix of the dimension

r \times k

. The heights of the convolutional kernel for the three channels’ r values are set to 3, 5, and 7, respectively.

For all the samples after convolution, a collection of feature maps is obtained as follows:

C = {c_{1}, c_{2}, \dots, c_{n - h + 1}}

(6)

In Equation (6), C denotes the set of feature maps after convolution,

c_{j}

is the convolutional kernel used to extract the first j localized features,

j = 1, 2, \dots, n - h + 1

, and

n - h + 1

is the number of feature maps after convolution is performed by the original feature dimension n and the height of the convolution kernel h is computed.

In this study, we used maximum pooling for the feature extraction and dimensionality reduction of the set of feature maps output from each channel. Maximum pooling means extracting the maximum value in each feature map, as shown in Equation (7).

C^{'} = m a x {C}

(7)

Here,

C^{'}

denotes the feature that has undergone the maximum pooling output.

m a x

means that the largest value in the feature region, which is the most responsive part of the feature, is selected to proceed to the next layer.

Three convolutional kernels of different sizes are used for feature extraction, and the output of each channel is pooled to obtain a feature vector of the size

1 \times r

, and then the outputs of the three channels are spliced together, as defined by

A = C_{1}^{'} \oplus C_{2}^{'} \oplus C_{3}^{'}

(8)

where

C_{1}^{'}

,

C_{2}^{'}

, and

C_{3}^{'}

are the output features after convolution and maximum pooling operations with convolutional kernels with heights of 3, 5, and 7, respectively. A is the output of the extracted localized feature vectors, and ⊕ represents the concatenation of vectors.

3.3.2. BiLSTM-Based Global Feature Extraction

BiLSTM [54,55,56] is a variant of LSTM which can model text in both forward and reverse directions, obtain the dependencies between the previous and current text, and realize global feature extraction from text sequences. The “gate” structure of BiLSTM is shown in Figure 4.

At time t, the current hidden layer state is noted as

h_{t}

, and the state of each gate is updated as shown:

f_{t} = σ (w_{c} \cdot [h_{t - 1}, x_{t} + b_{f}])

(9)

c_{t} = f_{t} \times c_{t - 1} + i_{t} \times t a n h (w_{c} \cdot [h_{t - 1}, x_{t} + b_{c}])

(10)

i_{t} = σ (w_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})

(11)

o_{t} = σ (w_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})

(12)

h_{t} = o_{t} \times t a n h (c_{t})

(13)

Here,

x_{t}

is the current input cell state.

f_{t}

is the forgetting gate at moment t, and

c_{t}

is the memory cell.

i_{t}

and

o_{t}

are the input gates and output gates, respectively, at moment t.

w_{c}

,

w_{i}

, and

w_{o}

are the weight matrices.

b_{f}

,

b_{c}

,

b_{i}

, and

b_{o}

are the biases of the forgetting gate, memory cell, input gate, and output gate, respectively.

σ

is the activation function, and

h_{t}

is the output of the hidden layer at time t. The operator × computes the vector product of the two vectors before and after the current point. · denotes the dot product, whose result is a numerical value.

BiLSTM models the text using LSTM in both forward and reverse directions and concatenates the outputs of the last moment of the forward and reverse networks together as the final output, as defined by

B = h_{a} \oplus h_{b}

(14)

where

h_{a}

and

h_{b}

denote the outputs of the last hidden layer state for forward and reverse modeling, respectively.

B \in R^{d \times 2}

is the output of the extracted global feature vector. ⊕ represents vector concatenation.

3.4. Feature Interaction Layer

The co-attention mechanism [30] reflects the relationship between feature vectors by calculating the shared similarity matrix, and then the shared similarity matrix is applied to the original feature vectors to reveal information interactions between features. On the basis of the cooperative attention mechanism and multi-head attention mechanism [34,57], we constructed a multi-head co-attention mechanism for the interactions between the local and global features of the complaint text. In the proposed model, the input of the multi-head co-attention mechanism is the local feature matrix output from the multi-channel feature extraction network,

A \in R^{d \times 3}

, and the global feature matrix

B \in R^{d \times 2}

, and h is the number of co-attention layers.

The input features are first downscaled through h individual linear layers to downscale the number of input features to obtain, thus obtaining the representation of the same feature vector in several different vector spaces as shown in Equations (15) and (16):

A_{1}, A_{2}, \dots, A_{h} = L i n e a r_{1, 2, \dots, h}^{a} (A)

(15)

B_{1}, B_{2}, \dots, B_{h} = L i n e a r_{1, 2, \dots, h}^{b} (B)

(16)

Here, a and b represent the linear dimensionality reductions of the A and B feature matrices, respectively. d is the vector dimension of a single channel, and h is the number of dimensionality reduction layers. The features

A_{i} \in R^{(d / h) \times 3}

and

B_{i} \in R^{(d / h) \times 2}

, after dimensionality reduction using the function

L i n e a r_{1, 2, \dots, h}^{b} (\cdot)

, are then inputted into multiple collaborative attention mechanisms to learn the relationships between the features. The structure of the ith collaborative attention mechanism is shown in Figure 5.

The shared similarity matrix

C_{i} \in R^{3 \times 2}

between feature vectors is first calculated within the collaborative attention mechanism, and the normalized similarity matrix

C_{i}

contains the information about the similarity between local and global features. In particular, the vector inner product is performed to compute the similarity matrix, so the widths of the feature vectors must be equal. It is calculated as follows:

C_{i} = t a n h (A_{i}^{T} \cdot B_{i})

(17)

where

A_{i}^{T}

is the transpose of the local feature matrix

A_{i}

, and

B_{i}

is the global feature matrix.

The similarity matrix is given by the outer product with the original feature vector matrix, which is used to extract the key information of the output feature vector of the other channel to reveal feature interactions. The extracted features

V_{i}^{a}

and

V_{i}^{b}

are calculated as

V_{i}^{a} = t a n h (A_{i} C_{i}), V_{i}^{b} = t a n h (B_{i} C_{i}^{T})

(18)

Here,

A_{i}

is the original local eigenmatrix.

B_{i}

is the original global feature matrix.

C_{i}

is the similarity matrix of the two, and

C_{i}^{T}

is the transpose of

C_{i}

.

The input and output feature dimensions of the co-attention mechanism are unchanged, and the local and global features of the ith outputs of the co-attention mechanism are denoted as

V_{i}^{a} \in R^{(d / h) \times 2}

and

V_{i}^{b} \in R^{(d / h) \times 3}

, respectively. Finally, the outputs of multiple co-attention layers are stitched together as shown in Figure 6 to obtain the feature vector interrelationship outputs

V i e w_{a b} \in R^{d \times 2}

and

V i e w_{b a} \in R^{d \times 3}

, calculated as shown:

V i e w_{a b} = c o n c a t (V_{1}^{a}, V_{2}^{a}, \dots, V_{h}^{a})

(19)

V i e w_{b a} = c o n c a t (V_{1}^{b}, V_{2}^{b}, \dots, V_{h}^{b})

(20)

In these equations, h is the number of layers of the multi-head co-attention mechanism.

V_{1}^{a}

is the result of each local feature interacting with a global feature, and

V_{1}^{b}

is the result of each global feature interacting with a local feature.

c o n c a t (\cdot)

denotes the concatenation of the pointers.

Since the width of the output matrix after dimensionality reduction is set as

d / h

, the input features of the multi-head co-attention mechanism and the dimension of the output features after concatenation remain the same.

In this study, we drew on operations similar to residual concatenation in ResNet and the operation in transformers that involves the attention mechanism being applied before being added to the original features. Specifically, the feature vectors of the interactive

V i e w_{a b}

, the

V i e w_{b a}

, and the local feature A and global feature B of the customer complaint text before the interaction are spliced into the fully connected layer, so as to avoid vanishing gradients and network degradation caused by the large number of layers in the model. Finally, a real feature vector,

F \in R d \times 10

, containing information about the relationship between feature vectors is obtained, defined as

F = c o n c a t (V i e w_{a b}, V i e w_{b a}, A, B)

(21)

where

c o n c a t (\cdot)

is the concatenation operation of the pointers.

V i e w_{a b}

is the local–global relational feature obtained on the basis of local features.

V i e w_{b a}

is the global–local relation feature obtained on the basis of global features, and A is the local feature before the interaction. B is the global feature before the interaction.

3.5. Multi-Label Classification Layer

3.5.1. Binary Relevance Model

Binary relevance [34] is the most straightforward of the multi-label classification methods. Its main idea is to decompose the multi-label classification task into q single-label supervised classification tasks. In particular, we trained individual classifiers to make predictions for individual labels and finally combined the q predictions of the individual classifiers into the final label set of the sample.

According to this basic idea, for the first label j in the label space

y_{j}

, by considering each training sample,

x_{i}

, and the label

y_{j}

correlations, a corresponding binary training set is constructed:

φ (Y_{i}, y_{i}) = \{\begin{matrix} + 1, & i f y_{i} \in Y_{i} \\ - 1, & o t h e r w i s e \end{matrix}

(22)

D_{j} = {(x_{i}, φ (Y_{i}, y_{i})) ∣ 1 \leq i \leq m}

(23)

Here,

Y_{i}

is the sample

x_{i}

of the set of labels of

φ (Y_{i}, y_{i})

, where m represents the total number of samples.

A binary learning algorithm, B, can then be used to train a classifier for the labels

y_{j}

, training to obtain a classifier,

g_{j} : X \to R

, such that

g_{j} : X \leftarrow B (D_{j})

. Thus, any multi-label training sample,

(x_{i}, Y_{i})

, in the text

x_{i}

will be involved in the q during the training process of the individual classifiers. For relevant labels,

y_{j} \in Y_{i}

, the

x_{i}

for a classifier,

g_{j}

, can be regarded as a positive example. On the other hand, for irrelevant samples,

y_{k} \in Y_{i}

, the

x_{i}

can be regarded as a negative example.

For the unseen samples, x, the binary relevance is predicted by q individual classifiers to predict their associated set of labels, Y. Eventually, all the predicted independent binary labeling results are stitched together to obtain the final prediction:

Y = {y_{j} ∣ g_{j} (x) > 0, 1 \leq j \leq q}

(24)

Using the binary relevance method for classification, the original problem is transformed into a number of unrelated single-label classification problems. For the samples, an empty set of predicted labels will be generated if each classifier predicts it as being a negative example. To avoid this, we can use the T-criterion rule, i.e., when every classifier predicts sample x as a negative sample, the label with the highest prediction probability among all the classifiers is selected and added to the label set.

3.5.2. GBDT Integration Algorithm

The GBDT algorithm is trained in an iterative manner, where the next iteration utilizes a negative gradient to measure the performance of the previous round of the weak learner. The previously occurring errors are corrected by fitting a mean square error (MSE) loss function in the negative gradient direction. Eventually, a fitting function,

f_{0} (x)

, is determined that makes the result approach the true value.

The basic principle is to first initialize the inaccurate value,

f_{0} (x)

, of the residual

r_{0} (x) = y - f_{0} (x)

, which is calculated based on the difference between it and the true value, y, of the sample. Thus, in order to find a suitable fitting function,

h_{0}

, to fit

r_{0} (x)

, the new fitting function for the next round can be expressed as

f_{i + 1} (x) = f_{i} (x) + h_{i} (x)

to calculate the new residual value

r_{i + 1} (x) = y - f_{i} (x)

. This is performed in an iterative manner. This iterative correction makes the sample prediction value

f (x)

approach the true value y. The GBDT algorithm is convergent and can achieve local or global optimization.

Specifically, the weak classifier is initialized first:

f_{0} (x) = a r g m i n \sum_{i = 1}^{N} L (y_{i}, ρ)

(25)

L (y, f (x)) = {(y - f (x))}^{2}

(26)

where

y_{i}

is the actual value of the sample and

ρ

is the constant.

L (\cdot)

is the loss function. We determined the constant

ρ

so that the initial prediction loss was minimized, and then the total loss for all N samples was

L_{a l l} = \sum_{i = 1}^{N} L (y_{i}, f_{m} (x_{i}))

(27)

where

y_{i}

and

f_{m} (x_{i})

are the sample,

x_{i}

, of the actual value and its first m prediction functions according to the first model, respectively. The purpose of the iteration is to minimize the loss value and find the direction of the steepest gradient descent, so the negative gradient function computed in each iteration is

- g (x_{i}) = - \frac{\partial L (y_{i}, f (x_{i}))}{\partial f (x_{i})}

(28)

We constructed a fitting function,

h (x_{i}; α)

, to fit the negative gradient

- g (x_{i})

where

α

is the residual coefficient. The residual parameters are obtained according to the following equation:

α_{m} = a r g m i n \sum_{i = 1}^{N} {(- g (x_{i}) - β h (x_{i}; α))}^{2}

(29)

in which

g (x_{i})

is the gradient and

β

is the coefficient.

The next step is to calculate the optimal

β_{m}

value:

β_{m} = a r g m i n \sum_{i = 1}^{N} (y_{i}, f_{m - 1} (x_{i}) - β h (x_{i}; α))

(30)

where

β_{m}

is the weighting factor and

f_{m - 1} (x_{i})

is the fitting function for the first

m - 1

iterations.

Eventually, the results of the calculations are merged into the model to update the prediction function:

f_{m} (x) = f_{m - 1} (x) + β_{m} h_{m} (x; α_{m})

(31)

Here,

h_{m} (x; α_{m})

is the first negative gradient fitting function after

m - 1

iterations. When the number of iterations preset by the algorithm is reached, the iteration is terminated.

3.6. Output Layer

We set different thresholds for each label to control the prediction results of the labels as needed. The final output is the category labels predicted by the model.

4. Experimental Results and Analysis

In the experimental phase, we tested whether the proposed multi-label classification model outperforms existing algorithms with regard to civil aviation complaint texts. We also verified whether the model can achieve the efficient multi-label classification of the complaint texts. The classification outcomes facilitated an in-depth analysis of the primary issues and trends in civil aviation complaints. This helped identify common complaint areas and potential areas for improvement, thereby providing valuable insights for airlines, airports, and sales platforms to enhance their services.

4.1. Experimental Data

In this study, the performance of the proposed model was tested using a complaint text dataset from civil aviation and related departments. The dataset consisted of complaint texts from between 1 and 11 January 2022, initiated by passengers and targeted to domestic airlines, domestic airports, and foreign and Hong Kong, Macao, and Taiwan airlines, as well as aviation sales network platforms. These texts were mainly in Chinese, consisting of 13 categories and totaling 5597 samples. The data were divided into training, testing, and validation sets according to the ratio of 7:2:1. During the data processing, we gave special attention to the segmentation of Chinese text and the construction of domain-specific feature word sets, aiming to ensure that the model accurately captures the unique characteristics of Chinese text. Notably, despite the test data being Chinese text, the proposed model is theoretically applicable to any text. This was achieved by simply considering the linguistic conventions of different languages during text segmentation and the construction of domain-specific feature word sets. The statistical results of the data utilized in this study are presented in Table 1.

4.2. Measurement Indicators

The algorithm performance was evaluated using the classical multi-label classification performance metrics: the F1 value, recall, and Hamming loss. Each performance statistic of multi-label classification prediction is defined as follows: T is the number of samples that predicted positive case samples as positive, F is the number of samples that predicted negative case samples as negative, and A is the number of samples that predicted negative case samples as positive cases or positive case samples as negative cases.

(1) Recall:

R = \frac{T}{T + F}

(32)

The recall, R, was used to evaluate the average proportion of all samples that successfully predicted the category.

(2)

F_{1}

value:

P = \frac{T}{T + A}

(33)

F_{1} = \frac{2 P R}{P + R}

(34)

The precision rate P denotes the proportion of the sample that is correctly categorized, and R is the recall of the predicted results. The

F_{1}

is the harmonic mean of the precision rate and the recall, which is a more generalizable metric than each individually.

(3) Hamming loss:

H = \frac{1}{p} \sum_{i = 1}^{p} \frac{1}{q} |h (x_{i}) Δ Y_{i}|

(35)

The Hamming loss was used to assess the proportion of misclassified samples. Here, p denotes the number of samples, q denotes the number of labels,

h (x_{i})

is the predicted labeling of the sample

x_{i}

, and

Y_{i}

is the true labeling of sample

x_{i}

.

Δ

denotes the dissimilarity between the two sets, and thus

|h (x_{i}) Δ Y_{i}|

denotes the number of samples with incorrect predictions. When each sample in the dataset is related to only a single label in the label set, the Hamming loss will degenerate to the traditional misclassification rate of

\frac{2}{q}

.

4.3. Result Analysis

4.3.1. Model Performance Comparison

In order to verify the effectiveness of combining multi-feature co-attention mechanisms and integration algorithms for classification, the following multi-label classification methods were chosen for comparative experiments:

(1) BR-GBDT. This is a multi-label classification method based on the integrated learning of the binary relevance (BR) and gradient boosting decision trees (GBDTs) which uses maximum pooling to extract features. It also combines the BR and GBDTs to achieve automated, high-accuracy multi-label classification.

(2) Multi-Head Co-attention Classifier. This is a multi-label classification model based on the multi-head co-attention mechanism. First, the BERT model is used to realize text vectorization representation, then the Text-CNN and BiLSTM multi-channel feature extraction networks are used to extract the local and global features of the complaint text, respectively. Finally, a co-attention mechanism is used to learn the relationship between the local and global features and realize the accurate classification of the customer complaint text.

(3) Multi-Feature Attention GBDT Classifier (MAG). This is a text classification algorithm that fuses multiple features and attention mechanisms and is the method proposed in this study. It considers multiple features, including label features, in the text vectorization representation stage, fuses the extracted global and local features of the complaint text using a multi-head co-attention mechanism in the feature extraction stage, and uses an integration method to obtain optimal multi-label classification. The model can fully extract text features as well as achieve efficient integration, thus enhancing the performance of the classifier.

As shown in Table 2, the proposed MAG model significantly outperforms the traditional BR-GBDT and Multi-Head Co-attention Classifier models in terms of both the F1 value and recall. Compared with the other two types of models, the MAG model improved the F1 value by 3.08 and 0.60 percentage points, respectively. Also, the Hamming loss of the MAG model was significantly lower than that of the other two methods. This means that the MAG can predict labels more accurately in multi-label classification tasks, reducing the risk of misclassification.

The combined application of multi-feature fusion, the multi-head co-attention mechanism, and GBDT integration enables the MAG model to capture the relationships between features and labels more accurately. Therefore, the MAG model has great potential to be used for the multi-label classification of complaint texts, improving the classification effect and enhancing the feasibility of application. The results of this research provide support for applying multi-feature co-attention mechanisms and integration algorithms in the field of multi-label text classification.

4.3.2. Analysis of Ablation Experiments

To verify the effectiveness of the prior extraction of labeled feature words, the global feature extraction, and the local feature extraction, five sets of comparative ablation experiments were performed. They involved a model with no label information (Model 1), a model with no global features (Model 2), a model with no local features (Model 3), a model with no fusion of features using the multi-head attention mechanism (Model 4), and the MAG model proposed in this paper. The results are shown in Table 3. It can be seen that the MAG model showed different degrees of improvement in each index compared to the four ablation models; this suggests the effectiveness of the model in considering the label information, global features, and local features.

As shown in Table 3, the proposed method achieved the best F1 value, recall, and Hamming loss index compared with feature extraction methods that do not use label features, global features, local features, or multi-attention fusion. Thus, the label features are able to enrich the textual information, and the fusion of the multi-channel feature extraction network improves the effect significantly compared with the single-channel feature extraction network. Moreover, the combination of the global and local features enables complementarity, which can facilitate the more comprehensive extraction of the complaint text information and alleviate the issues with short and sparse complaint texts. The results of this ablation experiment emphasize the importance of these elements in the original model, which play a key role in improving the performance for multi-label classification.

4.3.3. Label Classification Effect Analysis

Using the same dataset, the MAG model was applied for multi-label classification, and a comparison of the recognition effect of each label with the number of texts was obtained, as shown in Figure 7.

According to the sample proportion situation shown in Table 1, it is clear that ticket service and irregular flight service are the two types of problems with the highest degree of concern for civil aviation passengers: the number of texts on ticket service accounted for 53.39% of all texts, and the number of texts on irregular flight service accounted for 24.66% of all texts. An analysis of possible reasons for this is given as follows:

(1): High-frequency incidents: Complaints in the categories of ticketing services and irregular flight services are usually closely related to passengers’ daily travel. Problems in these two categories are common and therefore have a higher chance of giving rise to complaints.
(2): Direct impact on the passenger experience: Ticketing and flight services have a direct impact on the overall passenger experience and travel plans. Any problems related to ticketing errors, itinerary changes, flight delays, cancellations, etc., may lead to traveler complaints, as they can significantly affect passenger travel.
(3): Measurability: Problems in the ticket service and irregular flight service categories are usually easily measurable and recordable and therefore more likely to be recorded in complaint data. In contrast, other types of problems may be less easily captured and recorded.
(4): Cumulative effect: Problems with ticketing or flight services that occur during a single trip may result in multiple complaints, as they may affect multiple stages and aspects of the trip. This can lead to an increase in the percentage of complaints in both categories.

In summary, the larger share of complaints in the ticket service and irregular flight service categories is likely because these issues are more prevalent, directly affect the traveler’s experience, are easily documented and measured, and may generate multiple complaints during a single trip. These issues constitute important areas that airlines, airports, sales platforms, and other related institutions need to pay attention to and address.

According to the classification results, we can see that the majority class samples appeared more frequently in the training data, and the model is more likely to learn the features and patterns of the majority class. This makes the classification performance on the majority class samples tend to be more accurate. As shown in Figure 7, ticket service has the highest multi-label classification F1 value and the best classification effect; irregular flight service has the second-best classification effect. In addition, the recognition of minority sample labels is less effective due to the fact that other labels appear less in the training data, and their textual features are not learned as well.

5. Conclusions and Future Work

In order to better realize the multi-label text classification of traveler complaint data in the field of civil aviation, we proposed a novel multi-label classification model for complaint texts. This model fuses label information and attention mechanisms, combines the semantic features extracted by BERT and the label features extracted based on association rules to represent text vectors, and then extracts local and global features using a Text-CNN and Bi-LSTM, respectively. It fuses features using a multi-head co-attention mechanism, which reduces feature loss by fusing information at different granularities, and finally constructs a GBDT integrated classifier for each label. This improves the classification performance of the model. Experiments showed that compared to other models, the proposed model achieved better results on the civil aviation complaint text dataset, effectively improving the classification performance.

Through the analysis of the classification results, we can provide the following recommendations for airlines, airports, sales platforms, and other related customer service teams:

First, as the two main areas where passenger complaints are most concentrated are ticketing services and abnormal flight services, service providers should focus on improving these areas. Airlines can optimize ticketing processes to reduce errors and disputes, while airports can enhance real-time flight information updates and passenger services to minimize the inconvenience caused by flight delays or cancellations. Sales platforms can provide more accurate ticketing information and after-sales services to boost passengers’ confidence and satisfaction in their purchases.

Second, in response to the differences in the classification performance, service providers can target improvements in data collection and labeling. For categories such as ticketing services and abnormal flight services, which exhibit better classification results, valuable information from the data can be further explored to support service improvements and decision-making. For categories with poorer classification results, particularly those with few sample labels, data collection and labeling efforts should be increased to enhance the model’s generalization ability and accuracy.

Additionally, our research findings suggest that service providers should pay attention to the cumulative effect of passenger complaints. Issues related to ticketing or flight services that arise during a single trip may trigger a chain reaction, leading to multiple complaints. Therefore, service providers should establish effective complaint handling mechanisms, respond promptly to passengers’ issues, and prevent the escalation and worsening of problems.

In conclusion, our research not only reveals the primary issues and distribution patterns of civil aviation passenger complaints but also provides actionable directions and measures for airlines, airports, sales platforms, and other customer service teams. By targeting service process optimization, strengthening data collection and labeling, and establishing effective complaint handling mechanisms, service providers can further enhance service quality, increasing passenger satisfaction and loyalty.

In future work, hierarchical label features will be considered in multi-label text classification to further refine and extract the feature relationships of labels, potentially yielding more accurate category prediction. Simultaneously, we will explore innovations in natural language processing methods and tools to optimize the model performance, providing more precise and efficient solutions for the processing and analysis of passenger complaint data in the civil aviation industry.

Author Contributions

Conceptualization, H.C., P.Z. and H.L.; methodology, H.C. and P.Z.; validation, H.C., P.Z. and X.S.; formal analysis, H.C. and X.S.; data curation, H.C., P.Z. and H.L.; writing—original draft preparation, H.C. and X.S.; writing—review and editing, H.C. and X.S.; visualization, H.C. and X.S.; supervision, H.C. and H.L.; project administration, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds of CASTC (Grants No. x222060302544 and No. x232060302115) and the project titled “Application scenarios and applicability analysis of AI technology in the civil aviation field”.

Data Availability Statement

The data used in this paper can be requested from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

Liang, X.; Li, M. Text categorization of complain in telecommunication industry and its applied research. Chin. J. Manag. Sci. 2015, 23, 188–192. [Google Scholar]
Wang, J.; Yang, Y.; Yu, B. Customer complaint text classification model based on multi-head collaborative attention mechanism. Data Anal. Knowl. Discov. 2023, 7, 10. [Google Scholar]
Joachims, T. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML), Chemnitz, Germany, 21–23 April 1998; pp. 137–142. [Google Scholar]
Xu, B.; Zhang, Y. A new svm chinese text of classification algorithm based on the semantic kernel. In Proceedings of the 2011 International Conference on Multimedia Technology (ICMT), Hangzhou, China, 26–28 July 2011; pp. 2857–2860. [Google Scholar]
Hussain, M.G.; Rashidul Hasan, M.; Rahman, M.; Protim, J.; Al Hasan, S. Detection of bangla fake news using mnb and svm classifier. In Proceedings of the 2020 International Conference on Computing, Electronics Communications Engineering (ICCECE), Southend, UK, 17–18 August 2020; pp. 81–85. [Google Scholar]
Rongyan, L. A new algorithm of chinese text classification. J. Beijing Norm. Univ. (Nat. Sci.) 2006, 42, 501–505. [Google Scholar]
Tan, Y. An improved knn text classification algorithm based on k-medoids and rough set. In Proceedings of the 2018 10th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), Hangzhou, China, 25–26 August 2018; pp. 109–113. [Google Scholar]
Moldagulova, A.; Sulaiman, R.B. Using knn algorithm for classification of textual documents. In Proceedings of the 2017 8th International Conference on Information Technology (ICIT), Amman, Jordan, 17–18 May 2017; pp. 665–671. [Google Scholar]
Chen, Z.; Shi, G.; Wang, X. Text classification based on naive bayes algorithm with feature selection. Information 2012, 15, 4255–4260. [Google Scholar]
El Hindi, K.M.; Aljulaidan, R.R.; AlSalman, H. Lazy fine-tuning algorithms for naive bayesian text classification. Appl. Soft Comput. 2020, 96, 106652. [Google Scholar] [CrossRef]
Salton, G.; Wong, A.; Yang, C.-S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
Zhu, Y. Research on news text classification based on deep learning convolutional neural network. Wireless Commun. Mob. Comput. 2021, 2021, 1508150. [Google Scholar] [CrossRef]
Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
Pavlinek, M.; Podgorelec, V. Text classification method based on self-training and LDA topic models. Expert Syst. Appl. 2017, 80, 83–93. [Google Scholar] [CrossRef]
Mikolov, T. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Rui, Z.; Yutai, H. Research on short text classification based on word2vec microblog. In Proceedings of the 2020 International Conference on Computer Science and Management Technology (ICCSMT), Shanghai, China, 20–22 November 2020; pp. 178–182. [Google Scholar]
Hu, H.; Liao, M.; Zhang, C.; Jing, Y. Text classification based recurrent neural network. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 652–655. [Google Scholar]
Chen, G.; Ye, D.; Xing, Z.; Chen, J.; Cambria, E. Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2377–2383. [Google Scholar]
Tian, Q.; Kong, W.; Teng, J.; Wang, Z. Text sentiment analysis model based on parallel hybrid network and attention mechanism. Comput. Eng. 2022, 48, 266–273. [Google Scholar]
Han, Y.; Chen, C.; Su, H.; Liang, Y. A hybrid neural network text classification model integrating channel features. J. Chin. Inf. Process. 2021, 35, 78–88. [Google Scholar]
Ma, Y.; Huang, J.; Wang, F.; Rui, X. Multi-label classification of science and technology policy content using BERT and multi-scale CNN. J. Inf. 2022, 41, 157–163. [Google Scholar]
Gu, Y.; Shi, J. The social public issues analysis model based on deep learning. Sci. Program. 2022, 2022, 8676124. [Google Scholar] [CrossRef]
Rhanoui, M.; Mikram, M.; Yousfi, S.; Barzali, S. A CNN-BiLSTM model for document-level sentiment analysis. Mach. Learn. Knowl. Extract. 2019, 1, 832–847. [Google Scholar] [CrossRef]
Kavianpour, P.; Kavianpour, M.; Jahani, E.; Ramezani, A. A CNN-BiLSTM model with attention mechanism for earthquake prediction. J. Supercomput. 2023, 79, 19194–19226. [Google Scholar] [CrossRef]
Xu, X.; Wang, B. Research on multi-label patent classification algorithm based on text and historical data. Comput. Sci. 2024, 51, 172–178. [Google Scholar]
Bahdanau, D. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Yan, Y.; Liu, F.; Zhuang, X.; Ju, J. An R-transformer-BiLSTM model based on attention for multi-label text classification. Neural Process. Lett. 2023, 55, 1293–1316. [Google Scholar] [CrossRef]
Zhao, H.; Li, X.; Wang, F.; Zeng, Q.; Diao, X. Incorporating keyword extraction and attention for multi-label text classification. J. Intell. Fuzzy Syst. 2023, 45, 2083–2093. [Google Scholar] [CrossRef]
Wang, Z.; Dong, K.; Huang, J.; Wang, B. SEMFA: A large-scale multi-label text classification model based on semantic features and associative attention. Comput. Sci. 2023, 50, 270–278. [Google Scholar]
Lu, J.; Yang, J.; Batra, D.; Parikh, D. Hierarchical question-image co-attention for visual question answering. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016; pp. 289–297. [Google Scholar]
Hu, L.; Zhao, Z.; Qi, W.; Song, X.; Nie, L. Multimodal matching-aware co-attention networks with mutual knowledge distillation for fake news detection. Inf. Sci. 2024, 664, 120310. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Khan, M.; Gueaieb, W.; El Saddik, A.; Kwon, S. MSER: Multimodal speech emotion recognition using cross-attention with deep fusion. Expert Syst. Appl. 2024, 245, 122946. [Google Scholar] [CrossRef]
Boutell, M.R.; Luo, J.; Shen, X.; Brown, C.M. Learning multi-label scene classification. Pattern Recognit. 2004, 37, 1757–1771. [Google Scholar] [CrossRef]
Zhang, M.-L.; Li, Y.-K.; Liu, X.-Y.; Geng, X. Binary relevance for multi-label learning: An overview. Frontiers Comput. Sci. 2018, 12, 191–202. [Google Scholar] [CrossRef]
Yang, Z.; Emmert-Streib, F. Optimal performance of binary relevance CNN in targeted multi-label text classification. Knowl.-Based Syst. 2024, 284, 111286. [Google Scholar] [CrossRef]
Tsoumakas, G.; Katakis, I. Multi-label classification: An overview. Int. J. Data Warehous. Mining (IJDWM) 2007, 3, 1–13. [Google Scholar] [CrossRef]
Li, R.; Liu, W.; Lin, Y.; Zhao, H.; Zhang, C. An ensemble multi-label classification for disease risk prediction. J. Healthcare Eng. 2017, 2017, 8051673. [Google Scholar] [CrossRef]
Read, J.; Pfahringer, B.; Holmes, G.; Frank, E. Classifier chains for multi-label classification. Mach. Learn. 2011, 85, 333–359. [Google Scholar] [CrossRef]
Trajdos, P.; Kurzynski, M. Dynamic classifier chains for multi-label learning. In Proceedings of the 2019 DAGM German Conference on Pattern Recognition (GCPR), Dortmund, Germany, 10–13 September 2019; pp. 567–580. [Google Scholar]
Komatsu, T.; Watanabe, S.; Miyazaki, K.; Hayashi, T. Acoustic event detection with classifier chains. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 601–605. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Asadi, S.; Roshan, S.E. A bi-objective optimization method to produce a near-optimal number of classifiers and increase diversity in bagging. Knowl.-Based Syst. 2021, 213, 106656. [Google Scholar] [CrossRef]
Schapire, R.E.; Freund, Y. Boosting: Foundations and algorithms. Kybernetes 2013, 42, 164–166. [Google Scholar] [CrossRef]
Aravkin, A.Y.; Bottegal, G.; Pillonetto, G. Boosting as a kernel-based method. Mach. Learn. 2019, 108, 1951–1974. [Google Scholar] [CrossRef]
Grønlund, A.; Kamma, L.; Larsen, K.G.; Mathiasen, A.; Nelson, J. Margin-based generalization lower bounds for boosted classifiers. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 11963–11972. [Google Scholar]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Schapire, R.E.; Singer, Y. Boostexter: A boosting-based system for text categorization. Mach. Learn. 2000, 39, 135–168. [Google Scholar] [CrossRef]
Winata, G.I.; Khodra, M.L. Handling imbalanced dataset in multi-label text categorization using bagging and adaptive boosting. In Proceedings of the 2015 International Conference on Electrical Engineering and Informatics (ICEEI), Denpasar, Indonesia, 10–11 August 2015; pp. 500–505. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Devlin, J. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Yoo, J.; Cho, Y. ICSA: Intelligent chatbot security assistant using text-CNN and multi-phase real-time defense against SNS phishing attacks. Expert Syst. Appl. 2022, 207, 117893. [Google Scholar] [CrossRef]
Zhang, W.; Yang, X.; Yang, J.; Wang, Y.; Yu, H.; Zhang, M. Optimal design of digital FIR filters based on text convolutional neural network for aliasing errors reduction in FI-DAC. IEICE Electron. Express 2024, 21, 6–12. [Google Scholar] [CrossRef]
Li, W.; Qi, F.; Tang, M.; Yu, Z. Bidirectional LSTM with self-attention mechanism and multi-channel features for sentiment classification. Neurocomputing 2020, 387, 63–77. [Google Scholar] [CrossRef]
Lin, H.; Zhang, S.; Li, Q.; Li, Y.; Li, J.; Yang, Y. A new method for heart rate prediction based on LSTM-BiLSTM-ATT. Measurement 2023, 207, 112384. [Google Scholar] [CrossRef]
Zhang, X.; Yang, Y.; Liu, J.; Zhang, Y.; Zheng, Y. A CNN-BiLSTM monthly rainfall prediction model based on SCSSA optimization. IEICE Electron. Express 2024, 15, 4862–4876. [Google Scholar] [CrossRef]
Li, X.; Wei, Z.; Lyu, Z.; Yuan, X.; Xu, J.; Zhang, Z. Federated reinforcement learning based on multi-head attention mechanism for vehicle edge caching. In Proceedings of the 17th International Conference on Wireless Algorithms, Systems, and Applications (WASA), Dalian, China, 24–26 November 2022; pp. 648–656. [Google Scholar]

Figure 1. Multi-label classification model for civil aviation service quality complaint texts incorporating multiple features and attention mechanisms.

Figure 2. A flowchart of the BERT model.

Figure 3. An overview of the Text-CNN model, where different colors correspond to different sizes of convolutional kernels.

Figure 4. An overview of the BiLSTM model.

Figure 5. Schematic diagram of feature fusion.

Figure 6. A schematic diagram of the multi-head co-attention mechanism.

Figure 7. A comparison of the number of texts for each label with the trend in the classification effect.

Table 1. Experimental data description.

Form	Number of Texts	Proportion of Samples
Ticketing service	2988	53.39%
Irregular flight service	1380	24.66%
Check-in and boarding	468	8.36%
Baggage services	248	4.43%
Air services	127	2.27%
Member services	108	1.93%
First aid	99	1.77%
Notice of information	75	1.34%
Special passenger services	35	0.63%
Terminal basic services	29	0.52%
Overselling	25	0.45%
Airport merchant services	10	0.18%
Ground transportation services	5	0.09%

Table 2. Comparison of model performance results. The bold values represent the best performances.

Model	F1 Value	Recall	Hamming Loss
BR-GBDT	0.7200	0.7504	0.0856
Multi-Head Co-attention	0.6952	0.7086	0.0905
MAG (Ours)	0.7260	0.7656	0.0847

Table 3. Results of ablation experiments. The bold values represent the best performances.

Ablation Experiment	F1 Value	Recall	Hamming Loss
Model 1	0.7130	0.7549	0.0876
Model 2	0.7164	0.7656	0.0876
Model 3	0.7073	0.7362	0.0876
Model 4	0.7182	0.7656	0.0868
MAG (Ours)	0.7260	0.7656	0.0847

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, H.; Shao, X.; Zhou, P.; Li, H. Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study. Electronics 2025, 14, 434. https://doi.org/10.3390/electronics14030434

AMA Style

Cai H, Shao X, Zhou P, Li H. Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study. Electronics. 2025; 14(3):434. https://doi.org/10.3390/electronics14030434

Chicago/Turabian Style

Cai, Huali, Xuanya Shao, Pengpeng Zhou, and Hongtao Li. 2025. "Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study" Electronics 14, no. 3: 434. https://doi.org/10.3390/electronics14030434

APA Style

Cai, H., Shao, X., Zhou, P., & Li, H. (2025). Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study. Electronics, 14(3), 434. https://doi.org/10.3390/electronics14030434

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Label Classification of Complaint Texts: Civil Aviation Service Quality Case Study

Abstract

1. Introduction

2. Related Work

2.1. Classification of Complaint Texts

2.2. Multi-Label Text Classification

3. Study Design

3.1. Modeling Framework

3.2. Text Vectorization Representation Layer

3.2.1. BERT Word Embedding

3.2.2. Label Feature Word Embedding

3.2.3. Text Vector Representation Based on Attention Weighting of Label Features

3.3. Multi-Channel Feature Extraction Layer

3.3.1. Text-CNN-Based Localized Feature Extraction

3.3.2. BiLSTM-Based Global Feature Extraction

3.4. Feature Interaction Layer

3.5. Multi-Label Classification Layer

3.5.1. Binary Relevance Model

3.5.2. GBDT Integration Algorithm

3.6. Output Layer

4. Experimental Results and Analysis

4.1. Experimental Data

4.2. Measurement Indicators

4.3. Result Analysis

4.3.1. Model Performance Comparison

4.3.2. Analysis of Ablation Experiments

4.3.3. Label Classification Effect Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI