4.1. Dataset and Training Details
The corpus data of financial customer service were provided by a financial company in China, involving 2977 corpus texts about loan service.
Figure 5 illustrates the data distribution of the original corpus data, with the label indices range from 0 to 48. There is a certain discrepancy in the data distribution across various categories, with the ratio of the largest category size to the smallest one being approximately 3:1. Moreover, as shown in
Figure 6, the lengths of the original corpus data primarily range from 3 to 20 words. Also, the corpus samples contain a mixture of Chinese and English, which can easily lead to semantic confusions. To well train the text classification network, the semi-supervised data augmentation method described in
Section 3.2 is utilized to augment the original corpus texts to 3904 ones, which is split into training/validation/test sets in a ratio of 2688:384:832. Then, 2688 corpus texts are further preprocessed by several traditional augmentation methods to construct the final training set in this study, involving random deletion (twice), synonym replacement (twice), and back-translation (once). That is, 16,128 augmented corpus texts via semi-supervised data augmentation and traditional data augmentation were utilized for training, while 384 and 832 ones via semi-supervised data augmentation for validation and testing were utilized.
All the experiments were conducted on a computer with an intel(R)Xeon(R)CPU Es-2630 v3 @ 2.40GHz, a 64G RAM, NVIDIA GeForce RTX 2080 Ti. Python3 was utilized as the programming language, and PyTorch as the deep learning framework due to its widespread use in the field of deep learning. The pre-training model used in the experiments was the bert-base-chinese model with the following parameters: the batch_size was set to 8, the max_len was set to 30 for samples, the max_len was set to 400 for labels, the loss function was the cross-entropy loss function, the optimizer was Adam, the learning rate was set to 2 × 10−5, every 10 epochs was multiplied by a 0.1 attenuation factor, and the epoch was set to 30. Four commonly used metrics were utilized to perform a comprehensive evaluation of the intention recognition, which accuracy (Acc), precision (Prec), Recall, and F1-score (F1).
4.2. Comparisons with Other Deep Learning Methods
To validate the proposed hybrid intention recognition framework, we utilized eight deep learning-based text classification methods for comparisons, involving two CNN-based models (TextRCNN [
26] and DPCNN [
27]) and six Transformer-based models (FNet [
30], BERT [
14], Xiong et al. [
35], FinBERT [
44], BERT-DBRU [
53], and CBLMA-B [
54]). Specifically, there are five models specifically designed for text classification, involving two CNN-based models, a Transformer-structure model (FNet), and two BERT-based models (BERT [
14] and Xiong et al. [
35]). Two BERT-based models are specifically designed for intention recognition, involving BERT-DBRU and CBLMA-B. FinBERT is a tailored fine-tune BERT model for the financial domain, demonstrating strong capabilities in sentiment classification for financial texts. During training, the two CNN-based models were directly trained on our training set. FNet and FinBERT were fine-tuned on our training set using their pre-trained models. The other four BERT-based models were fine-tuned on our training set, using the pre-trained Google BERT model [
14]. For fair comparisons, all the comparison models were trained according to the configurations described in their studies and tested on our test set. The comparison results achieved by different methods are shown in
Table 4.
The DPCNN [
27] increases the perceptual field by multiple down-sampling operations to obtain the long-term dependency of the text. However, the associations of phrases within the text need to be supported by sufficient data, and customers with the same intention category have different expressions in real scenarios. These two facts result in the fact that the DPCNN cannot function well for new text expressions.
The TextRCNN [
26] utilizes the RNN to capture the contextual information of the text, which results in a much better intention recognition performance than the DPCNN. However, the TextRCNN is also a one-hot encoder method, which limits its ability to extract the associations of word-to-word. Moreover, the short length of the sentences in financial service scenarios leads to limited contextual information in the text.
FNet [
30] replaces the self-attention layer in traditional Transformer networks with Fourier transform to capture feature information by transforming the features into the frequency domain. So, it can achieve a much better intention recognition performance than the two CNN-based models and several BERT models. However, its Fourier transform possibly loses some long-term dependencies, which to some extent influences its recognition ability.
Due to BERT’s ability to effectively capture global contextual semantic representations, five existing BERT-based methods (BERT [
14], Xiong et al. [
35], FinBERT [
44], BERT-DBRU [
53], and CBLMA-B [
54]) perform significantly better than the two CNN-based models. Xiong et al. [
35] cascades the BERT structure and the bi-GRUs to promote the pre-trained Google BERT model, while BERT-DBRU [
53] concatenates texts and corresponding text labels to input into the pre-trained Google BERT model. So, the two models more or less perform a better intention recognition than the Google BERT [
14]. Since FinBERT [
44] directly fine-tunes the Google BERT model on a large amount of financial text data, it can achieve a fairly good performance for intention recognition on our dataset, with an accuracy of 86.78%, a precision of 87.78%, a recall of 88.29%, and an F1 score of 87.68%. Comparatively, CBLMA-B [
54] utilizes a relatively complex network structure for feature extraction, which is a dual-branch structure incorporating BERT, CNN, BiLSTM, and self-attention. So, it achieves the second-best performance for intention recognition, with an accuracy of 87.50%, a precision of 88.68%, a recall of 88.63%, and an F1 score of 88.19%. However, its simple concatenation of the dual-branch features cannot reveal the inherent relationship between texts and corresponding text labels, which limits its intention recognition ability.
Our model performs best in the intention recognition task for financial customer service, achieving an accuracy of 89.06%, a precision of 90.27%, a recall of 90.40%, and an F1 of 90.07%. This is primarily attributed to four aspects. Firstly, the label semantic inference scheme provides effective label information for datasets lacking real label semantics and serves as a constraint for subsequent text classification. Secondly, the text classification network seamlessly integrates shallow and deep features of the corpus data and their label semantics at different levels for intention recognition. Thirdly, the StD mechanism enriches the semantic features in the deep network, alleviating the issue of feature collapse. Additionally, the augmented dataset provides more trainable data, enabling the model to learn more variations in the intention categories.
Tabel 4 illustrates the inference time for 832 corpus texts via various models. Two CNN-based models have significant advantages in terms of inference speed due to their simple structures and small numbers of parameters. The Fourier transforms in FNet are time-consuming, which results in the most inference time of 12.65 s for FNet. Since the three BERT-based models (i.e., Xiong et al. [
35], FinBERT [
44], and BERT-DBRU [
53]) utilize various simple schemes to improve the classification ability of the Google BERT model [
14], they cost nearly inference time, ranging from 0.8 s to 1.7 s. In comparison, the dual-branch structure of CBLMA-B [
54] makes it time-consuming, with an inference time of 4.65 s. Due to feature interactions in the dual-branch structure, our model consumes more inference time than CBLMA-B [
54], with an inference time of 11.87 s.
4.3. Ablation Experiments
Here, we conducted ablation experiments to discuss the influences of the semi-supervised data augmentation (SDA) and label semantic inference (LSI) on the hybrid framework, and those of the network structures on the text classification network. Since SDA and LSI generate the augmented corpus data and corresponding label semantics, respectively, which serve as the inputs of the text classification network, they do not participate in the joint training of the text classification network. So, we conducted two ablation experiments to separately discuss the influences of SDA and LSI, as well as those of the text classification network structures, as shown in
Table 5 and
Table 6, respectively.
4.3.1. Framework Ablation Study
To validate SDA and LSI in our hybrid framework, we conducted an ablation experiment to discuss the text classification network (baseline model) with/without SDA and/or LSI. The hybrid framework only with SDA shows that the inputs of the baseline model are the original corpus data and their corresponding label semantics. The hybrid framework only with LSI shows that the inputs of the baseline model are the augmented corpus data.
As shown in
Table 5, the hybrid framework with only LSI performs better than the baseline model, indicating that the incorporation of label semantic knowledge into the corpus data can effectively help the text classification network learn label-related semantic information to alleviate the influence of ambiguous expressions. However, insufficient corpus data make LSI not well find the approximately central text representing the label semantics. Thus, the hybrid framework with only LSI is inferior to that with only SDA, with an intention recognition performance of 87.21% vs. 87.59% F1, demonstrating the significance of sufficient corpus data for intention recognition. When both SDA and LSI are combined with the baseline model, the hybrid framework achieves the best intention recognition performance.
4.3.2. Classification Module Ablation Study
To validate SDF and StD in our designed text classification network, we conducted an ablation experiment to discuss the hybrid framework with/without SDF and/or StD. Specifically, the baseline framework in this ablation study shows that the hybrid framework involves SDA and LSI for augmented corpus data and the corresponding label semantics, and the dual-branch baseline text classification model without SDF and StD.
Since StD can, to some extent, alleviate the feature collapse with the increase of the network depth to enrich the representation of deep features, the hybrid framework with only StD performs better than the baseline framework, with an intention recognition performance of 89.39% vs. 88.95% F1. However, due to the lack of effective interactions between label semantics and corpus semantics, the text classification network fails to effectively integrate label semantics into the corpus information. Therefore, the text classification network using only StD is less effective in intention recognition performance compared to the one using only SDF, with F1s of 89.48% and 89.39%, respectively. This indicates the importance of label semantic fusion for intention recognition. When the baseline framework is combined with both SDF and StD, its corresponding intention recognition performance reaches the best value.