Application of Generative Adversarial Networks and Shapley Algorithm Based on Easy Data Augmentation for Imbalanced Text Data
Abstract
:1. Introduction
2. Related Work
2.1. Data Augmentation
2.2. GAN-Based Text Generation
2.3. Shapley Algorithm
3. Methodology
3.1. Stage I: Initial Training Sentences Generated Using the EDA Method
- Synonym replacement (SR): During this process, words are randomly replaced by one of their synonyms as obtained from the WordNet library; this process is denoted as .
- Random insertion (RI): During this process, a number of words from Wordnet are inserted randomly; this process is denoted as .
- Random swapping (RS): During this process, number of words are randomly switched while retaining the same parts of speech; this process is denoted as .
- Random deletion (RD): Number of words are randomly deleted; this process is denoted as .
- Random mixing (RM): this process is expressed as , where the initial training sentences are generated by the previous four processes, is the generated word sequence, is the category that falls under and is identical to that of the original data.
3.2. Stage II: GAN-Based Model Training Using the SentiGAN Model
3.2.1. Generator
3.2.2. Discriminator
3.2.3. Training Procedure
- Step one: The generator and discriminator , are randomly initialized. The training data with a random vector by normal distribution are used to pretrain the generator; this pretraining is based on maximum likelihood estimation.
- Step two: The generator generates the fake data, including C classes of fake data, and it uses the fake data and training data to pretrain the discriminator .
- Step three: Every g-step performed in the generator generates sentences for C classes. First, a noise vector z is created from a normal distribution and passed into the generator . Second, the loss is obtained from the discriminator to minimize the total loss .
- Step four: Every d-step performed in the discriminator updates the discriminator’s model parameters through the generated sentences, which is generated from the , and training data , and which include original training sentences and initial training sentences obtained through EDA.
- Step five: Steps two to four are repeated until model convergence is achieved, and the final generated training sentences are output.
3.3. Stage III: GAN-Based Model Evaluation
3.4. Stage IV: Final Training Sentence Generation by Optimal SentiGAN and Shapley Algorithm
- Step one: The generated data () are duplicated once, and their label is flipped. Suppose the data points are one which mean negative class, and the Shapley algorithm needs to flip the original data points to positive class.
- Step two: Validation and generated data are encoded through BERT, and the feature pertaining to the special token ([CLS]) is selected and used to calculate Euclidean distance.
- Step three: The Euclidean distance between each piece of generated data and validation data is calculated.
- Step four: The Euclidean distance within each piece of generated data are summed so that the distance of each generated piece of data is represented by only one number.
- Step five: Distances are sorted from minimum to maximum.
- Step six: The Shapley value is calculated by using the generated sentences with the greatest Euclidean distance. If the generated data and validation data share the same label, the Shapley value is one; otherwise, it is zero. Furthermore, the N must be considered, which is the amount of data generated. Thus, the Shapley value must divided by N.
- Step seven: To calculate the second largest piece of generated data, we must consider the value of the previous piece of generated data and check if its label is identical to that of the validation data. We must also check if the label of the previous data is identical to that of the validation data. If their labels are identical, the Shapley value is one; otherwise, it is zero, and the first number should be subtracted from the second number. Subsequently, the Shapley value is divided by Q, multiplied by the smaller number of Q or the value of previous data points, and divided by the position of the previous data. The paper sets the Q is 10.
- Step eight: The generated data undergo a removal process so that their Shapley value is negative, and they then undergo a selection process so that their label is identical to the few samples’ category; this process is performed to enable the use of the data for augmentation. Finally, we combine the remaining generated data with the original training data to form the final training sentences for the proposed framework for classifying imbalanced data.
3.5. Stage V: Classification Task
4. Experiment
4.1. Data Set
4.2. Experimental Design
4.3. Augmentation Models
- EDA: This model applies EDA to generate final training sentences.
- CatGAN: This model preserves the quality and diversity of generated samples by using the hierarchical evolutionary algorithm for training and by designing evaluation metrics to filter generators.
- SentiGAN: This model incorporates multiclass generators and conducts training through reinforcement learning.
- EDA + CatGAN: This model applies EDA to generate initial training sentences and uses the CatGAN model to generate final training sentences.
- EDA + SentiGAN: This model applies the EDA method to generate initial training sentences and uses the SentiGAN model to generate final training sentences.
- EDA + CatGAN + Shapley: This model is similar to the subsequent proposed model but uses a CatGAN instead of a SentiGAN.
- EDA + SentiGAN + Shapley: This model is the proposed model of the present study.
4.4. Results on Sensitivity Analysis of the EDA Method
4.5. Results of the GAN-Based Models
4.6. Results Obtained through the Shapley Algorithm
4.7. Results Obtained through EDA Methods
4.8. Results for Classification Performance of Augmentation Models
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Abdalla, H.I.; Amer, A.A. On the Integration of Similarity Measures with Machine Learning Models to Enhance Text Classification Performance. Inf. Sci. 2022, in press. [Google Scholar] [CrossRef]
- Li, K.; Yan, D.; Liu, Y.; Zhu, Q. A Network-based Feature Extraction Model for Imbalanced Text Data. Expert Syst. Appl. 2022, 195, 116600. [Google Scholar] [CrossRef]
- Lu, X.; Chen, M.; Wu, J.; Chang, P. A Novel Ensemble Decision Tree Based on Under-Sampling and Clonal Selection for Web Spam Detection. Pattern Anal. Appl. 2018, 21, 741–754. [Google Scholar] [CrossRef]
- Liu, S.; Zhang, K. Under-sampling and Feature Selection Algorithms for S2SMLP. IEEE Access 2020, 8, 191803–191814. [Google Scholar] [CrossRef]
- Wei, J.; Zou, K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 6382–6388. [Google Scholar] [CrossRef] [Green Version]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2 (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA; pp. 2672–2680. [Google Scholar]
- Wang, K.; Wan, X. SentiGAN: Generating Sentimental Texts via Mixture Adversarial Networks. In Proceedings of the IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 4446–4452. [Google Scholar] [CrossRef] [Green Version]
- Liu, Z.; Wang, J.; Liang, Z. CatGAN: Category-Aware Generative Adversarial Networks with Hierarchical Evolutionary Learning for Category Text Generation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8425–8432. [Google Scholar] [CrossRef]
- Liang, W.; Liang, K.H.; Yu, Z. HERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Bangkok, Thailand, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA; Volume 1, pp. 3652–3665. [Google Scholar] [CrossRef]
- Ghorbani, A.; Zou, J. Data Shapley: Equitable Valuation of Data for Machine Learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2242–2251. [Google Scholar]
- Wu, J.; Chung, W. Sentiment-based masked language modeling for improving sentence-level valence–arousal prediction. Appl. Intell. 2022, in press. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2018; Volume 1, pp. 4171–4186. [Google Scholar] [CrossRef]
- Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8018–8025. [Google Scholar] [CrossRef]
- Garg, S.; Ramakrishnan, G. Bae: Bert-Based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 8–12 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 6174–6181. [Google Scholar] [CrossRef]
- Zhao, M.; Zhang, L.; Xu, Y.; Ding, J.; Guan, J.; Zhou, S. EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification. arXiv 2022, arXiv:2204.11205. [Google Scholar]
- Karimi, A.; Rossi, L.; Prati, A. AEDA: An Easier Data Augmentation Technique for Text Classification. In Findings of the Association for Computational Linguistics: EMNLP 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2748–2754. [Google Scholar] [CrossRef]
- Ren, S.; Zhang, J.; Li, L.; Sun, X.; Zhou, J. Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 9029–9043. [Google Scholar] [CrossRef]
- Kobayashi, S. Contextual augmentation: Data Augmentation by Words with PARADIGMATIC relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: Stroudsburg, PA, USA; Volume 2, pp. 452–457. [Google Scholar] [CrossRef]
- Wu, X.; Lv, S.; Zang, L.; Han, J.; Hu, S. Conditional Bert Contextual Augmentation. In Proceedings of the International Conference on Computational Science, Faro, Portugal, 12–14 June 2019; Springer: Cham, Switzerland; pp. 84–95. [Google Scholar] [CrossRef] [Green Version]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Anaby-Tavor, A.; Carmeli, B.; Goldbraich, E.; Kantor, A.; Kour, G.; Shlomov, S.; Tepper, N.; Zwerdling, N. Do Not Have Enough Data? Deep Learning to the Rescue! In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7383–7390. [Google Scholar] [CrossRef]
- Wu, X.; Gao, C.; Lin, M.; Zang, L.; Wang, Z.; Hu, S. Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Association for Computational Linguistics: Stroudsburg, PA, USA; Volume 2, pp. 871–875. [Google Scholar] [CrossRef]
- Jo, B.C.; Heo, T.S.; Park, Y.; Yoo, Y.; Cho, W.I.; Kim, K. DAGAM: Data Augmentation with Generation and Modification. arXiv 2022, arXiv:2204.02633. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Liu, X.; Chen, S.; Song, L.; Wozniak, M.; Liu, S. Self-attention Negative Feedback Network for Real-time Image Super-Resolution, Journal of King Saud University. Comput. Inf. Sci. 2021, 34, 6179–6186. [Google Scholar] [CrossRef]
- Liu, S.; He, T.; Li, J.; Li, Y.; Kumar, A. An Effective Learning Evaluation Method Based on Text Data with Real-time Attribution—A Case Study for Mathematical Class with Students of Junior Middle School in China. ACM Trans. Asian Low Resour. Lang. Inf. Process 2022, 10, 3474367. [Google Scholar] [CrossRef]
- Yu, L.; Zhang, W.; Wang, J.; Yu, Y. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Guo, J.; Lu, S.; Cai, H.; Zhang, W.; Yu, Y.; Wang, J. Long Text Generation via Adversarial Training with Leaked Information. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. Available online: https://ojs.aaai.org/index.php/AAAI/article/view/11957 (accessed on 1 August 2022).
- Li, Y.; Pan, Q.; Wang, S.; Yang, T.; Cambria, E. A generative model for category text generation. Inf. Sci. 2018, 450, 301–315. [Google Scholar] [CrossRef]
- Nie, W.; Narodytska, N.; Patel, A. Relgan: Relational Generative Adversarial Networks for Text Generation. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Zheng, M.; Li, T.; Zhu, R.; Tang, Y.; Tang, M.; Lin, L.; Ma, Z. Conditional Wasserstein generative adversarial network-gradient penalty-based approach to alleviating imbalanced data classification. Inf. Sci. 2020, 512, 1009–1023. [Google Scholar] [CrossRef]
- Kumar, I.E.; Venkatasubramanian, S.; Scheidegger, C.; Friedler, S. Problems with Shapley-value-based explanations as feature importance measures. In Proceedings of the International Conference on Machine Learning, Virtual Event, 24–26 November 2020; pp. 5491–5500. [Google Scholar] [CrossRef]
- Jia, R.; Dao, D.; Wang, B.; Hubis, F.A.; Hynes, N.; Gürel, N.M.; Spanos, C.J. Towards Efficient Data Valuation Based on the Shapley Value. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Okinawa, Japan, 16 April 2019; pp. 1167–1176. [Google Scholar] [CrossRef]
- Ancona, M.; Oztireli, C.; Gross, M. Explaining Deep Neural Networks with A Polynomial Time Algorithm for Shap-LEY value Approximation. In Proceedings of the 2019 International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 272–281. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadephia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
Training Set | Validation Set | Test Set | |
---|---|---|---|
Positive | 1600 | 400 | 500 |
Negative | 800 | 200 | 250 |
Total | 2400 | 600 | 750 |
Model | Parameter | Positive | Negative | ||
---|---|---|---|---|---|
α | Macro BLEU-2 | Macro Self-BLEU-2 | Macro BLEU-2 | Macro Self-BLEU-2 | |
EDA + SentiGAN | 0.1 | 0.7398 | 0.9532 | 0.6340 | 0.9420 |
EDA + SentiGAN | 0.3 | 0.7310 | 0.9664 | 0.6728 | 0.9622 |
EDA + SentiGAN | 0.5 | 0.6580 | 0.9554 | 0.6492 | 0.9504 |
Model | Positive | Negative | ||
---|---|---|---|---|
Macro BLEU-2 | Macro Self-BLEU-2 | Macro BLEU-2 | Macro Self-BLEU-2 | |
CatGAN | 0.8270 | 0.9806 | 0.7950 | 0.9822 |
SentiGAN | 0.6640 | 0.9156 | 0.6000 | 0.9374 |
EDA + CatGAN | 0.8418 | 0.9802 | 0.6303 | 0.8122 |
EDA + SentiGAN | 0.7398 | 0.9532 | 0.6340 | 0.9420 |
Method | GD | SP < 0 | SP > 0 | POS | NEG |
---|---|---|---|---|---|
EDA + CatGAN + Shapley | 2399.80 | 1009.80 | 3789.80 | 2398.80 | 1391.00 |
EDA + SentiGAN + Shapley | 2399.80 | 1179.40 | 3620.20 | 2393.20 | 1227.00 |
Model | Validation Set | Test Set | ||
---|---|---|---|---|
Macro F1 | SD F1 | Macro F1 | SD F1 | |
EDA(RM) | 0.834 | 0.0114 | 0.828 | 0.0084 |
EDA(SR) | 0.830 | 0.0100 | 0.828 | 0.0110 |
EDA(RI) | 0.834 | 0.0084 | 0.826 | 0.0055 |
EDA(RS) | 0.830 | 0.0071 | 0.822 | 0.0110 |
EDA(RD) | 0.832 | 0.0110 | 0.828 | 0.0084 |
Augmentation Model | Validation Set | Test Set | ||
---|---|---|---|---|
Macro F1 | SD of F1 | Macro F1 | SD of F1 | |
Without (only using imbalanced data) | 0.841 | 0.0134 | 0.832 | 0.0083 |
EDA | 0.834 | 0.0114 | 0.828 | 0.0084 |
CatGAN | 0.828 | 0.0130 | 0.826 | 0.0055 |
SentiGAN | 0.826 | 0.0090 | 0.824 | 0.0090 |
EDA + CatGAN | 0.848 | 0.0123 | 0.830 | 0.0071 |
EDA + SentiGAN | 0.834 | 0.0114 | 0.832 | 0.0084 |
EDA + CatGAN + Shapley | 0.860 | 0.0071 | 0.854 | 0.0090 |
EDA + SentiGAN + Shapley | 0.860 | 0.0071 | 0.854 | 0.0055 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, J.-L.; Huang, S. Application of Generative Adversarial Networks and Shapley Algorithm Based on Easy Data Augmentation for Imbalanced Text Data. Appl. Sci. 2022, 12, 10964. https://doi.org/10.3390/app122110964
Wu J-L, Huang S. Application of Generative Adversarial Networks and Shapley Algorithm Based on Easy Data Augmentation for Imbalanced Text Data. Applied Sciences. 2022; 12(21):10964. https://doi.org/10.3390/app122110964
Chicago/Turabian StyleWu, Jheng-Long, and Shuoyen Huang. 2022. "Application of Generative Adversarial Networks and Shapley Algorithm Based on Easy Data Augmentation for Imbalanced Text Data" Applied Sciences 12, no. 21: 10964. https://doi.org/10.3390/app122110964
APA StyleWu, J. -L., & Huang, S. (2022). Application of Generative Adversarial Networks and Shapley Algorithm Based on Easy Data Augmentation for Imbalanced Text Data. Applied Sciences, 12(21), 10964. https://doi.org/10.3390/app122110964