Next Article in Journal
Measuring Awareness of Social Engineering in the Educational Sector in the Kingdom of Saudi Arabia
Next Article in Special Issue
Optimizing Small BERTs Trained for German NER
Previous Article in Journal
Vectorization of Floor Plans Based on EdgeGAN
Previous Article in Special Issue
Ontology-Based Approach to Semantically Enhanced Question Answering for Closed Domain: A Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms

1
Key Laboratory of Electromagnetic Wave Information Technology and Metrology of Zhejiang Province, College of Information Engineering, China Jiliang University, Hangzhou 310018, China
2
National University of Singapore, 4 Architecture Drive, Singapore 117566, Singapore
3
Fujian Province University Key Laboratory of Computational Science, School of Mathematical Sciences, Huaqiao University, Quanzhou 362021, China
*
Author to whom correspondence should be addressed.
Information 2021, 12(5), 207; https://doi.org/10.3390/info12050207
Submission received: 19 March 2021 / Revised: 9 May 2021 / Accepted: 10 May 2021 / Published: 12 May 2021

Abstract

:
In the era of big data, multi-task learning has become one of the crucial technologies for sentiment analysis and classification. Most of the existing multi-task learning models for sentiment analysis are developed based on the soft-sharing mechanism that has less interference between different tasks than the hard-sharing mechanism. However, there are also fewer essential features that the model can extract with the soft-sharing method, resulting in unsatisfactory classification performance. In this paper, we propose a multi-task learning framework based on a hard-sharing mechanism for sentiment analysis in various fields. The hard-sharing mechanism is achieved by a shared layer to build the interrelationship among multiple tasks. Then, we design a task recognition mechanism to reduce the interference of the hard-shared feature space and also to enhance the correlation between multiple tasks. Experiments on two real-world sentiment classification datasets show that our approach achieves the best results and improves the classification accuracy over the existing methods significantly. The task recognition training process enables a unique representation of the features of different tasks in the shared feature space, providing a new solution reducing interference in the shared feature space for sentiment analysis.

1. Introduction

With the fast development of e-commerce, the automated sentiment classification (ASC) method for reviews on various products is demanded in the field of nature language processing (NLP) [1]. ASC methods classify the reviews into positive/negative sentiment classes with satisfactory efficiency and accuracy [2]. More specifically, ASC intends to explore the in-depth attitudes and perceptions (such as positive, negative) from the text body associated with the user natural awareness.
Recently, many forms of neural networks (NN) have been proposed for ASC [3,4,5]. Inspired by the human behaviors that handle multiple tasks simultaneously, multi-task learning neural network (MTL-NN) is proposed, extending the NN with a more sophisticated internal structure. The MTL-NN is a hierarchical structure of NN performing sentiment analysis receiving data containing multiple tasks as input [1]. For example, an online shopping website contains review comments associated with various products, such as books, televisions, handphones, etc. Traditional single-task learning (STL) NN experiences difficulties analyzing text pieces mixing different product types. MTL-NN handles the entire text piece involving comments under different products. There are in general two main mechanisms for multi-task learning methods: (a) the soft-sharing mechanism that applies a task-specific layer to different tasks [6,7,8]; (b) the hard-sharing mechanism that utilizes a powerful shared feature space to extract features for different tasks [9,10,11]. There are advantages and limitations for both sharing mechanisms.
With the continuous development of different versions of MTL, the soft-sharing mechanism has been widely adopted for ASC under different situations. However, there exist problems for the soft-sharing mechanisms, such as handling the interference between tasks and insufficient feature representations [12]. To address the above-mentioned issues, this paper proposes a sentiment analysis model that is based on the hard-sharing mechanism. A task recognition mechanism is proposed, which allows each task to obtain a unique representation in the hard-sharing feature space. The implemented model consists of three main steps. The first step consists of a lexicon encoder, which is used to encode the input data. It adds position and segment embedding to the word embedding. The second step contains a shared encoder, which is used to extract features from the data of several different tasks. These features form a shared feature space that provides supportive features for the subsequent private layers. The third step employs a private encoder, which consists of two layers: one is the task-specific layer for recognizing sentiment information, and the other is the task recognition layer.
The main contributions of our study can be summarized as follows:
  • The proposed model addresses the issue of interference and generalization of the shared feature space during multi-task learning.
  • The proposed model comprises three encoders, including a lexicon encoder, a shared encoder, and a private encoder, to improve the quality of extracted features.
  • We propose a task recognition mechanism that makes the shared feature space have unique representation for different tasks.

2. Related Works

As one of the popular fields of natural language processing (NLP) [13,14], various sentiment classification methods were proposed in the recent years. For example, the Word2vec [15] technique, proposed by Google in 2013, significantly improves the traditional feature engineering methods for text classification. The Word2vec maps characters into low-dimensional vectors, representing the intrinsic connections between words [16,17]. The Word2vec accelerates the development of deep learning techniques in the field of sentiment classification
More recently, various deep learning algorithms were proposed for sentiment analysis, such as TextCNN [18], TextRNN [7], HAN [19], etc. These algorithms use different neural networks to process the text of different lengths. For example, convolutional neural networks [20,21] are used to extract features from sentences. Recurrent neural networks are used to extract features from paragraphs [22,23], and attention mechanisms are used to extract features from articles [19,24]. However, these algorithms cannot be directly applied to multi-task sentiment analysis.
The MTL approach allows the model to extract features from multiple tasks simultaneously. The MTL technique was firstly used in the field of computer vision [25,26]. Numerous experiments have demonstrated that MTL is better than single-task learning methods on multi-task sentiment analysis [27,28]. The latent correlations among similar tasks that can be extracted by MTL are potentially helpful in improving the classification results.
Based on the neural network structure, MTL can be divided into soft-sharing MTL and hard-sharing MTL [29]. The soft-sharing mechanism divides the features into shared features and private features, which reduces the interference between multiple tasks [30,31]. However, it requires learning separate features for each task as private features. These private features are not shared. Thus, the parameters are not used effectively [32]. The hard-sharing mechanism allows the shared layer to extract features from all tasks, which can be used by all tasks [6]. Multiple tasks interfere with each other in the shared layer, but the interference between tasks is exploited to improve generalizability. To reduce the interference, the hard-sharing mechanism provides a private layer for each task [33].

3. Methodology

The overall structure of the model is shown in Figure 1, where the lexicon encoder is used to encode the data for each task into a lexicon embedding. The shared encoder extracts the semantic features from the embedding. A shared feature space is formed by the semantic features extracted by the shared encoder. The private encoder consists of two parts: one is task-specific layers, which are used to learn semantic features related to the source of the review, and the other is the task recognition layer.

3.1. The Lexicon Encoder

The lexicon encoder is a feature extraction encoder that addresses the issue of converting input text to word vectors. The input to the lexicon encoder can be a sentence or a paragraph. The output of the encoder is usually the representation of the sum of corresponding token, segment, and position embedded. The embedded position is calculated from the positions of input vectors, as shown in Equation (1).
P E ( p o s , 2 i ) = sin ( p o s / 10000 2 i / d model ) ,
P E ( p o s , 2 i + 1 ) = cos ( p o s / 10000 2 i / d model ) ,
where p o s is the position of the input vector; i is the dimension, and d model is the dimension of the word vector. The lexicon encoder converts the input X into d model dimension embeddings for the shared encoder learning (Section 3.2).

3.2. Shared Encoder

The shared encoder is used to extract the common sentence features in multiple tasks and places them into a shared feature space. To make the shared feature space contain richer semantic features, the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model [34,35] is introduced into our shared encoder. The pre-trained BERT model consists of multiple Transformer Encoders, which can be used to encode sentences. The structure of the shared encoder is shown in Figure 2.
From Figure 2, the transformer encoder model is composed of a stack of N = 6 identical layers and specifically addresses the issue of learning long-term dependencies, which are composed of multi-head attention mechanisms and position-wise feed-forward networks. The two sub-layers are connected by residual connection [20] and layer normalization [21].
The multi-head attention allows the model to pay attention to the information in different locations. Multi-head attention is composed of multi-dimensional self-attention, which linearly projects the query keys and values h times. The self-attention consists of queries and keys of dimension d k , and values of the dimension d v . We compute the self-attention products with Equation (3):
A t t ( Q , K , V ) = s o f t m a x ( Q K T d k ) V ,
where Q is the queries matrix; K is the keys matrix; V is the values matrix; and d k is the dimension of the queries and keys. Finally, we utilize the softmax function to calculate the weight of every input token.
On each of the projected versions, the self-attention is computed in parallel. The outputs are concatenated. The final output can be calculated by projecting them again.
H ( Q , K , V ) = ( h 1 h 2 h n ) W O ,
h i = A t t ( Q W i Q , K W i K , V W i V ) ,
where is the concatenation operator; h i is the i -th attention representation of multi-head attention. The projections are parameter matrices W i Q d model × d k , W i V d model × d k , W i K d model × d k , W i O d model × d k .
Position-wise feed-forward networks are composed by two linear transformations and a nonlinear activation function Relu:
FFN ( x ) = W 2 · Relu ( W 1 x + b 1 ) + b 2
We utilize residual connections and layer normalizations to connect the input layer, multi-head attention mechanisms, and position-wise feed-forward networks:
LN ( x ) = σ ( x + F ( x ) ) ,
where F ( x ) represents the output of sub-layers; σ represents layer normalization; LN ( x ) represents the output of the layer normalization.

3.3. Private Encoder

The private encoder is composed of a task-specific layer and a task recognition layer. The task-specific layer is used to extract emotion features that are independent of tasks. Therefore, there are multiple multi-scale CNN layers, which are designed for different tasks. The task recognition layer is used to learn task-recognized features. The overall structure of the private encoder is shown in Figure 3.
From Figure 3, the multi-scale CNN is composed of multiple convolution layers. Each convolution layer is composed of multiple convolution kernels of different sizes that are used to extract text features of different scales in the shared feature space.
Let x i : i + j refer to the concatenation of words x i ,   x i + 1 ,   ,   x i + j . A convolution operation is a convolution filter w h h k sliding on a window of size h to generate new features. For example, convolution is calculated on the words x i : i + h 1 . A new feature can be generated by
c i = f ( w h · x i : i + h 1 + b ) ,
where b is a bias term; f is the ReLU activation function.
We apply the convolution filter to all possible word combinations { x 1 : h , x 2 : h + 1 , , x n h + 1 : n } . A feature map can be generated by:
c h = [ c 1 ,   c 2 , , c n h + 1 ] ,
where c h n h + 1 . We apply the max-pooling operation [3] to further process the feature c h . The maximum value of c h as a feature.
c ^ h = max { c h } .
Multiple features c ^ h of different length h are extracted by multiple convolution filters of different sizes, which represent token information of different lengths. The final features c ^ are concatenated by the multiple features c ^ h extracted by convolution kernels of different sizes.

3.4. The Task Recognition Mechanism

Inspired by adversarial training [32], we propose a task recognition mechanism that uses the three encoders to learn the different features between each task while performing sentiment classification.
In the training process, for a text dataset containing N samples { x i , y i } , we utilize the cross-entropy function as the loss function. It is calculated that the cross-entropy of the true and the predicted distributions occurs on all the tasks. The model is optimized in the direction of minimizing the cross-entropy value.
L ( y ^ , y ) = i = 1 N j = 1 C y i j log ( y ^ i j ) ,
where y i j is the ground-truth label; y ^ i j is prediction probabilities, and C is the class number.
Task Discriminator. The task discriminator is used to map the shared representation of sentences into a probability distribution, estimating the probabilities of the original task for the encoded sentences.
During the task recognition training process, a separate multi-scale CNN layer is designed for each task. There are independent parameters in different multi-scale CNN layers. Therefore, the interference between different tasks can be relieved. Suppose that the input sample belongs to task k , the corresponding multi-scale CNN is MCNN ( k ) . The output is:
y ^ ( k ) = MCNN ( k ) ( x ( k ) ) ,
where x ( k ) is a sample of task k ; y ^ ( k ) is prediction probabilities of task k . For the data of multiple tasks, we calculate the weighted sum of the loss for each task.
L T a s k = k = 1 K α k L ( y ^ ( k ) , y ( k ) ) ,
where α k is the weight for each task k . K is the number of tasks.
A task recognition training process is designed to learn different features from among tasks and influence the representation in the shared feature space by backpropagation. The schematic diagram of task recognition training is shown in Figure 4.
Recognition Loss. Different with most existing multi-task learning algorithms, we add an extra recognition loss L r e c to add task-recognized features to shared feature space. The recognition loss is used to train a model to produce task-recognized features such that a classifier can reliably predict the task based on these features. The original loss of the task recognition training process is limited since it can only be used in binary situations. To overcome this, we extend it to multi-class form, which allows our model to be trained together with multiple tasks:
L r e c = 1 K k = 1 K i = 1 N j = 1 C p i j log ( p ^ i j )
where K is the total number of tasks, N is the total number of samples, and C is the number of samples for task i. For each i, there are samples j(1, C). p i j represents the predicted task that the sample j belongs to. Therefore, p i j is the task label, and p ^ i j is prediction probability of p i j . It is noted that the L r e c requires only the input sentence x and does not require the corresponding label y . The final loss function of the model can be written as:
L = L t a s k + λ L r e c ,
where λ is a constant coefficient.

4. Experimental Process and Results

4.1. Dataset and Metrics

As shown in Table 1, the dataset that we employed in this experiment contains 16 different datasets from several popular review corpora, including books, electronics, DVD, kitchen, apparel, camera, health, music, toys, video, baby, magazines, software, sports, IMDB, and MR. The first 14 datasets are product reviews, which are collected by Blitzer et al. [33]. The remaining two datasets are movie reviews, which are from the IMDB datasets [34] and the MR datasets [18]. There are about 2000 reviews for each commodity, for a total of about 32,000 reviews. The goal is to classify a review as either positive or negative. All the datasets in each task are partitioned randomly into the training set, validation set, and test set with the proportion of 70%, 20%, and 10%, respectively. The detailed statistics about all the datasets are listed in Table 2.
In addition, we collected four different types of commodity review datasets of daily necessities, literature, entertainment, and media from the raw data provided by Blitzer et al. [33] and formed dataset II. Each item in dataset II has more entries compared to dataset I. We also divided the training set, validation set, and test set for dataset II, and ensured that the number of positive and negative samples in each set did not differ much. Instances and statistics of dataset II are shown in Table 3 and Table 4.
In the experiment, we use the same evaluation criteria for each commodity review data set and each method, which are accuracy and F1-score.

4.2. Compared with Other Sentiment Classification Methods

In order to verify the effectiveness of our proposed MTL-REC sentiment classification model, we select seven existing sentiment classification models for comparative study, including CNN [16], LSTM [35], bidirectional LSTM (Bi-LSTM) [36], LSTM with Attention (LSTM_Att) [17], MTL-CNN [7], MTL-GRU [37], MTL-ASP [12]. The initial hyper-parameter settings for all deep learning models include: number of hidden layers: 3; hidden layer size: 64; convolutional kernel sizes of the three hidden layers: 3, 4, and 5, respectively (only for CNN); optimizer: Ranger; learning rate: 0.2; dropout rate 0.5; epoch: 5; λ = 0.5. The source code of all models is available at: http://www.github.com/zhang1546/Multi-Task-Learning-for-Sentiment-Analysis.git (accessed on 12 May 2021).
The experimental results are shown in Table 5 and Table 6.
Table 5 shows the accuracy of 16 sentiment classification tasks. Table 6 shows the accuracy of four sentiment classification tasks. The column of Avg. shows the average accuracy of the previous four single models. The highest accuracy rates are bolded in Table 5 and Table 6. From Table 5, we can see that multi-task learning models work better than single tasks in most tasks. From Table 6, we can see our proposed MTL-REC model outperforms all compared existing methods in all the cases. The classification accuracy improvements are visualized in Figure 5. In Figure 5, it is noted that the classification accuracy improvement with the proposed method over all compared methods is from 2% to 7%. The significant classification accuracy improvement is mainly achieved by the hard-sharing mechanism and the task recognition training process. The task sharing layer reduces the interference between multiple tasks. Table 5 also shows that the CNN extracts text features similar to that of GRU and LSTM encoders and takes less time. In Table 6, the average accuracy of the multi-task learning model is almost the same as that of the single-task learning model. Table 7 shows a statistical test over the results shown in Table 5. The difference between the proposed method and each compared method is evaluated using the Wilcoxon signed-rank test. The p-values show that the proposed method is significantly different from the compared methods. Table 8 shows the overall time and memory used by different methods on dataset I and II. The proposed MTL-REC encoder improves the sentiment classification performance significantly, but requires more running time and memory. The time complexity of the sentiment analysis is usually not the main concern, since the feature extract part can always be performed offline.

4.3. Model Self-Comparision

To demonstrate the effectiveness of the proposed method, a comparative experiment is conducted. Table 9 and Table 10 reflect the fact that the BERT and task recognition training process are helpful in sentiment classification tasks.
In Table 9 and Table 10, the highest accuracy rates and F1 scores are highlighted using bold font. According to Table 9 and Table 10, the sentiment classification performance is further improved with BERT and the proposed task recognition mechanism.

5. Conclusions

In this paper, we propose a multi-task learning framework for sentiment classification with a novel task recognition mechanism. We introduce the pre-trained BERT as our shared encoder to further improve the performance of the shared encoder. In addition, we propose a task recognition training process, which enhances the shared feature space to obtain more task-recognized features. We designed a series of experiments to validate our proposed method. The experimental results show that the sentiment classification results of our proposed model are superior to existing state-of-art methods. Both semantic features and task-recognized features are extracted, enhancing the overall classification performance.
It is noted that we introduce the pre-trained BERT model, which reduces the efficiency of the algorithm and leads to longer computation times. The proposed method shows a significant improvement on the accuracy of sentiment classification.
As one of the future works, we will improve on the shared encoder to reduce the time complexity of the proposed multi-task learning algorithm. In addition, more challenging datasets, such as unbalanced, noisy datasets, and datasets in different languages, will be tested on the proposed method.

Author Contributions

Conceptualization, K.Y.; methodology, K.Y.; software, J.Z.; validation, K.Y.; formal analysis, K.Y.; investigation, Y.M.; resources, Y.M.; data curation, Y.M.; writing—original draft preparation, K.Y.; writing—review and editing, K.Y.; visualization, J.Z.; supervision, K.Y. and Y.M.; project administration, K.Y. and Y.M.; funding acquisition, K.Y. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. LY19F020016 (K.Y.), in part by the National Natural Science Foundation of China under Grant 61972156, and Program for Innovative Research Team in Science and Technology in Fujian Province University (Y.M.).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and data used in this study is freely available at an open-source version control website: http://www.github.com/zhang1546/Multi-Task-Learning-for-Sentiment-Analysis.git (accessed on 12 May 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gómez-Adorno, H.; Fuentes-Alba, R.; Markov, I.; Sidorov, G.; Gelbukh, A. A convolutional neural network approach for gender and language variety identification. J. Intell. Fuzzy Syst. 2019, 36, 4845–4855. [Google Scholar] [CrossRef]
  2. Dejun, Z.; Mingbo, H.; Lu, Z.; Fei, H.; Fazhi, H.; Zhigang, T.; Yafeng, R. Attention Pooling-Based Bidirectional Gated Recurrent Units Model for Sentimental Classification. Int. J. Comput. Intell. Syst. 2019, 12, 723–732. [Google Scholar]
  3. Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
  4. Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-Stitch Networks for Multi-task Learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
  5. Liu, P.; Qiu, X.; Huang, X. Recurrent Neural Network for Text Classification with Multi-Task Learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, New York, NY, USA, 9–15 July 2016; AAAI Press: Palo Alto, CA, USA; pp. 2873–2879. [Google Scholar]
  6. Ruder, S.; Bingel, J.; Augenstein, I.; Søgaard, A. Latent Multi-Task Architecture Learning. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 4822–4829. [Google Scholar] [CrossRef]
  7. Collobert, R.; Weston, J. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th international conference on Machine learning, Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
  8. Subramanian, S.; Trischler, A.; Bengio, Y.; Pal, C.J. Learning General Purpose Distributed Sentence Representations via Large Scale Multi-Task Learning. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  9. Liu, X.; He, P.; Chen, W.; Gao, J. Multi-Task Deep Neural Networks for Natural Language Understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4487–4496. [Google Scholar]
  10. Bengio, Y.; Ducharme, R.; Vincent, P.; Janvin, C. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 2003, 3, 1137–1155. [Google Scholar]
  11. Bing, L. Sentiment Analysis and Opinion Mining; Morgan & Claypool: San Rafael, CA, USA, 2012. [Google Scholar]
  12. Liu, P.; Qiu, X.; Huang, X. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2017; pp. 1–10. [Google Scholar]
  13. Sun, S.; Luo, C.; Chen, J. A review of natural language processing techniques for opinion mining systems. Inf. Fusion 2017, 36, 10–25. [Google Scholar] [CrossRef]
  14. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, CA, USA, 5–10 December 2013; Curran Associates Inc.: Red Hook, NY, USA, 2013; Volume 2, pp. 3111–3119. [Google Scholar]
  15. Nyberg, K.; Raiko, T.; Tiinanen, T.; Hyvönen, E. Document Classification Utilising Ontologies and Relations between Doc-uments. In Proceedings of the Eighth Workshop on Mining and Learning with Graphs, Washington, DC, USA, 5 August 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 86–93. [Google Scholar]
  16. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2014; pp. 1746–1751. [Google Scholar]
  17. Yang, Z.; Yang, D.; Dyer, C.; He, X.; Smola, A.; Hovy, E. Hierarchical attention networks for document classification. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1480–1489. [Google Scholar]
  18. Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; Association for Computational Linguistics: Stroudsburg, PA, USA, 2005; pp. 115–124. [Google Scholar]
  19. Yanmei, L.; Yuda, C. Research on Chinese Micro-Blog Sentiment Analysis Based on Deep Learning. 2015 8th Int. Symp. Comput. Intell. Des. 2015, 1, 358–361. [Google Scholar] [CrossRef]
  20. Graves, A.; Jaitly, N.; Mohamed, A.-R. Hybrid speech recognition with Deep Bidirectional LSTM. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Olomouc, Czech Republic, 8–13 December 2013; pp. 273–278. [Google Scholar]
  21. Wen, S.; Wei, H.; Yang, Y.; Guo, Z.; Zeng, Z.; Huang, T.; Chen, Y. Memristive LSTM Network for Sentiment Analysis. IEEE Trans. Syst. Man, Cybern. Syst. 2021, 51, 1794–1804. [Google Scholar] [CrossRef]
  22. Zhang, S.; Xu, X.; Pang, Y.; Han, J. Multi-layer Attention Based CNN for Target-Dependent Sentiment Classification. Neural Process. Lett. 2020, 51, 2089–2103. [Google Scholar] [CrossRef]
  23. Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
  24. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Do-main-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
  25. Zhang, Z.; Luo, P.; Loy, C.C.; Tang, X. Facial Landmark Detection by Deep Multi-Task Learning. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 94–108. [Google Scholar]
  26. Daumé, H. Bayesian Multitask Learning with Latent Hierarchies. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, Canada, 18–21 June 2009; AUAI Press: Arlington, VA, USA, 2009; pp. 135–142. [Google Scholar]
  27. Sun, T.; Shao, Y.; Li, X.; Liu, P.; Yan, H.; Qiu, X.; Huang, X. Learning Sparse Sharing Architectures for Multiple Tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Association for the Advancement of Artificial Intelligence (AAAI): Palo Alto, CA, USA, 2020; Volume 34, pp. 8936–8943. [Google Scholar]
  28. Liu, P.; Qiu, X.; Huang, X. Deep Multi-Task Learning with Shared Memory for Text Classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–4 November 2016; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2016; pp. 118–127. [Google Scholar]
  29. Hessel, M.; Soyer, H.; Espeholt, L.; Czarnecki, W.; Schmitt, S.; Van Hasselt, H. Multi-Task Deep Reinforcement Learning with PopArt. In Proceedings of the 2019 AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Association for the Advancement of Artificial Intelligence (AAAI): Palo Alto, CA, USA, 2019; Volume 33, pp. 3796–3803. [Google Scholar]
  30. Liu, X.; Gao, J.; He, X.; Deng, L.; Duh, K.; Wang, Y.-Y. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2015; pp. 912–921. [Google Scholar]
  31. Li, J.; Monroe, W.; Shi, T.; Jean, S.; Ritter, A.; Jurafsky, D. Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics (ACL): Stroudsburg, PA, USA, 2017; pp. 2157–2169. [Google Scholar]
  32. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Ad-versarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Bangkok, Thailand, 18–22 November 2020; MIT Press: Cambridge, MA, USA, 2014; Volume 2, pp. 2672–2680. [Google Scholar]
  33. Blitzer, J.; Dredze, M.; Pereira, F. Biographies, Bollywood, Boom-Boxes and Blenders: Domain Adaptation for Sentiment Classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, 25–27 June 2007; Association for Computational Linguistics: Stroudsburg, PA, USA, 2007; pp. 440–447. [Google Scholar]
  34. Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 142–150. [Google Scholar]
  35. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  36. Graves, A.; Mohamed, A.-R.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; Institute of Electrical and Electronics Engineers (IEEE): Piscataway, NJ, USA, 2013; pp. 6645–6649. [Google Scholar]
  37. Lu, G.; Gan, J.; Yin, J.; Luo, Z.; Li, B.; Zhao, X. Multi-task learning using a hybrid representation for text classification. Neural Comput. Appl. 2020, 32, 6467–6480. [Google Scholar] [CrossRef]
Figure 1. The overall framework of the proposed MTL-REC model.
Figure 1. The overall framework of the proposed MTL-REC model.
Information 12 00207 g001
Figure 2. Structure of the shared encoder.
Figure 2. Structure of the shared encoder.
Information 12 00207 g002
Figure 3. Structure of the private encoder.
Figure 3. Structure of the private encoder.
Information 12 00207 g003
Figure 4. The task recognition training process.
Figure 4. The task recognition training process.
Information 12 00207 g004
Figure 5. Visualization of the maximum, minimum, and averaged classification accuracy of all compared methods.
Figure 5. Visualization of the maximum, minimum, and averaged classification accuracy of all compared methods.
Information 12 00207 g005
Table 1. Instances of the testing dataset I.
Table 1. Instances of the testing dataset I.
Commodity TypeExampleLabel
Booksthis is a resource used by all nps i have talked to. great addition to your library.1
it was a mistake to buy it. only few pages were interestin0
Electronicsgreat product but is only $ 30 at iriver.com’s stor1
i dont like this mouse, i brought, and never work, its useles0
DVDan awesome film with some suspense and raunchiness all rolled in to one1
i love pablo’s act on comedy central. this one does n’t even touch it0
Kitchenit is very light and worm. i love it. definitely worth the price!1
for the price, you get what you pay for. they are not the best quality0
Apparelrecipient was very satisfied with this blanket as pb are his initials1
a red star !?!? i bet this wo n’t sell well in eastern europe.0
Cameraeverything was excellent. the digital camera, the delivery. thank you a lot !!!!1
have had it for a few weeks and glad i brought it great procuc0
Healthgreat tasting bar. nice and soft make it easy to eat1
it does n’t get hot enough, nor does it stay hot for more than 10 min0
Musici just love lynch mixed with dooms production. it is what real is1
this cd isnt real good if you like compilations than get the ruff ryders c0
Toysthese make meals a lot more fun for children... i know my son loves them1
fisher price is selling the same item for only $ 33. $ 139.99 has to be a mistake0
Videothis is an excellent documentary of shangri-la and its elusive transcendental nature1
i love norm macdonald and this is the dumbest movie of all tim0
Babygreat product—i heard from other mommies that this was the pump to get; i agree1
rent a hospital grade medalia pump. you wont be sorr0
Magazinesthe magazine was shipped in a timely manner, i would use this vendor again1
i still have not received this magazine, what is taking so long !!0
Softwaremy husband is using the rosetta stone spanish program and loves it1
the “bad serial number” routine as the first reviewer.0
Sportsexcellent quality; much easier to put on than the cap i used before1
this pillow is too small and it is not comfortable at all0
IMDBthis is a truly magnificent and heartwrenching film !!!1
argh! this film hurts my head. and not in a good way.0
MRit’s a feel-good movie about which you can actually feel good. 1
a decidedly mixed bag. 0
Table 2. Dataset I statistics.
Table 2. Dataset I statistics.
Commodity TypeTraining SetValidation SetTest SetTotal
PositiveNegativePositiveNegativePositiveNegative
Books79880210595971032000
Electronics705693971031982021998
DVD80279895105102982000
Kitchen706694102981922082000
Apparel690710951052151852000
Camera706692991001942061997
Health81278898102901102000
Music698702103971992012000
Toys79480699101107932000
Video694706931072131872000
Baby80070010397971031900
Magazines682688101992171831970
Software78872710298110901915
Sports712687981021902101999
IMDB79580598102101992000
MR77882210298106942000
Total11,96011,820159016092428237231,779
Table 3. Instances of the testing dataset II.
Table 3. Instances of the testing dataset II.
Commodity TypeExampleLabel
Daily Necessitiesgreat product—i heard from other mommies that this was the pump to get; i agree1
rent a hospital grade medalia pump. you wont be sorr0
Literaturean excellent book for anyone that barbecues1
imposible to do so with no item received0
Entertainmentthank you, i like this program and it does what i need it to do1
i would not buy it ! hard to use. my machine runs slower since the install.0
Mediai received “the piano” promptly, and in pristine, excellent condition.1
if this is n’t worst dead album then in the dark is0
Table 4. Dataset II statistics.
Table 4. Dataset II statistics.
Commodity TypeTraining SetValidation SetTest SetTotal
PositiveNegativePositiveNegativePositiveNegative
Daily Necessities160914861991991872133893
Literature225723053082924203805962
Entertainment228522192993014073935904
Media297830073894116135847982
Total91299017119512031627157023,741
Table 5. Performance of single-task model and multiple tasks on multiple tasks dataset I.
Table 5. Performance of single-task model and multiple tasks on multiple tasks dataset I.
TaskSingle TaskMultiple Tasks
CNNLSTMBi-LSTMLSTM_AttAvg.MTL-CNNMTL-GRUMTL-ASPProposed
Books0.870.8650.90.90.8840.890.880.840.915
Electronics0.8250.840.8480.8520.8410.8620.8420.8680.885
DVD0.80.8350.870.8550.8400.850.820.8550.875
Kitchen0.8480.8780.850.8550.8580.860.8720.8620.865
Apparel0.8750.8650.8720.860.8680.8550.8720.870.895
Camera0.8550.8780.8650.850.8620.880.8920.8920.888
Health0.8450.8550.8650.830.8490.8850.8750.8820.865
Music0.8250.8380.8250.8120.8250.8420.830.8250.845
Toys0.8450.8750.890.880.8730.8550.8650.880.875
Video0.8720.880.8820.8750.8770.8780.8850.8450.91
Baby0.8850.8750.8750.8550.8730.890.90.8820.865
Magazines0.8520.850.8650.8550.8560.8820.90.9220.9
Software0.890.9050.8850.8850.8910.9050.8950.8720.91
Sports0.8580.8580.850.840.8520.8750.8620.8570.908
IMDB0.840.860.890.8750.8660.8650.8550.8550.925
MR0.730.740.7150.7550.7350.720.70.7670.79
AVG0.8450.8560.8590.8520.8530.8620.8590.8610.882
STD0.03730.03480.04160.03260.03470.04020.04710.03270.0321
Table 6. Performance of single-task model and multiple tasks on multiple tasks dataset II.
Table 6. Performance of single-task model and multiple tasks on multiple tasks dataset II.
TaskSingle TaskMultiple Tasks
CNNLSTMBi-LSTMLSTM_AttAvg.MTL-CNNMTL-GRUMTL-ASPProposed
Daily Necessities0.8500.8500.8650.8520.8540.8550.8480.8650.878
Literature0.8600.8340.8450.8310.8430.8510.8290.8500.865
Entertainment0.8700.8610.8510.8780.8650.8740.8690.8600.898
Media0.8450.8540.8650.8630.8570.8450.8660.8580.880
AVG0.8560.8500.8570.8560.8550.8560.8530.8580.880
STD0.01080.009910.008760.01710.00805 0.0108 0.01600.005400.0118
Table 7. The total time and memory used by different methods.
Table 7. The total time and memory used by different methods.
Statistical MethodsLevene’s TestWilcoxon Signed-Rank Test
p-ValueEvaluationp-ValueEvaluation
Proposed-CNN0.93homogeneity of variance1significant difference
Proposed-LSTM0.781homogeneity of variance0.998significant difference
Proposed-Bi-LSTM0.977homogeneity of variance0.999significant difference
Proposed-LSTM_Att0.717homogeneity of variance1significant difference
Proposed-MTL-CNN0.952homogeneity of variance0.997significant difference
Proposed-MTL-GRU0.723homogeneity of variance0.994significant difference
Proposed-MTL-ASP0.858homogeneity of variance0.991significant difference
Table 8. The total time and memory used by different methods.
Table 8. The total time and memory used by different methods.
TaskSingle TaskMultiple Tasks
CNNLSTMBi-LSTMLSTM_AttMTL-CNNMTL-GRUMTL-ASPProposed
Time (s)30.9334.3257.2958.8330.8058.75188.283282
Memory (MB)5776476476475776476491145
Table 9. Performance improvement using BERT and task recognition training on multiple tasks dataset I.
Table 9. Performance improvement using BERT and task recognition training on multiple tasks dataset I.
TaskWithout BERTWithout Task Recognition MechanismWith BERT
AccF1AccF1AccF1
Books0.9150.9150.910.910.9150.912
Electronics0.8320.8190.8550.8440.8850.876
DVD0.840.8450.8550.8530.8750.876
Kitchen0.850.840.8520.8440.8650.856
Apparel0.8650.870.8920.90.8950.902
Camera0.8780.8730.880.8810.8880.888
Health0.8750.8590.8750.8570.8650.846
Music0.8320.8350.8620.8590.8450.845
Toys0.8750.8860.870.8830.8750.886
Video0.8880.8940.9050.9110.910.914
Baby0.90.8950.8550.8540.8650.862
Magazines0.8780.8810.8780.8870.90.91
Software0.9150.9220.9150.9220.910.916
Sports0.8720.8620.8850.8750.9080.901
IMDB0.910.910.910.9130.9250.925
MR0.7550.7490.810.8120.790.788
AVG0.8670.8660.8760.8750.8820.881
STD0.03900.04170.02680.03000.03210.0346
Table 10. Performance improvement using BERT and task recognition training on multiple tasks dataset II.
Table 10. Performance improvement using BERT and task recognition training on multiple tasks dataset II.
TaskWithout BERTWithout Task Recognition NetworkWith BERT
AccF1AccF1AccF1
Daily Necessities0.850.840.8520.8410.8780.869
Literature0.8640.8670.8690.8750.8650.871
Entertainment0.8540.8540.8840.8840.8980.899
Media0.8520.8550.8820.8840.880.882
AVG0.7550.7490.810.8120.8820.881
STD0.005390.009570.01280.01770.01180.0119
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhang, J.; Yan, K.; Mo, Y. Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms. Information 2021, 12, 207. https://doi.org/10.3390/info12050207

AMA Style

Zhang J, Yan K, Mo Y. Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms. Information. 2021; 12(5):207. https://doi.org/10.3390/info12050207

Chicago/Turabian Style

Zhang, Jian, Ke Yan, and Yuchang Mo. 2021. "Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms" Information 12, no. 5: 207. https://doi.org/10.3390/info12050207

APA Style

Zhang, J., Yan, K., & Mo, Y. (2021). Multi-Task Learning for Sentiment Analysis with Hard-Sharing and Task Recognition Mechanisms. Information, 12(5), 207. https://doi.org/10.3390/info12050207

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop