A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam
Abstract
:1. Introduction
- Fast filter mode classifier to determine whether each input tweet is spam or not.
- Every filtered spam tweet is paraphrased to generate a new spam sentence with different definition with the same meaning.
- Ensemble deep learning methods are collected in addition to the statistical features to decide the output of the classifier.
2. Literature Review
3. Problem Statement
4. The Proposed Model
4.1. Learning from Detected Spam Tweets
4.2. Generate New Tweets
4.3. Ensemble Method
4.3.1. Convolution Neural Network
4.3.2. Recurrent Neural Networks
4.3.3. Feature-Based Model
4.3.4. Proposed Ensemble Approach
- First, CNN is used with four convolution layers, which is trained with Twitter glove [37].
- Second, CNN is also used with four convolution layers to extract features and then classify them using the SVM algorithm. This CNN is trained with Twitter Glove in all dimensions.
- Third, the LSTM network is used and trained with the Hspam dataset, which contains 14 million tweets [18] and with Twitter Glove.
5. Experiments and Results
5.1. Dataset
5.2. Evaluation Metrics
5.3. Experiments Settings
6. Results and Discussion
6.1. Primary Twitter Filter
6.2. User-Based Features
6.3. Ensemble Method
6.4. Meta-Classifier
6.5. Performance of Learned Model
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chu, Z.; Widjaja, I.; Wang, H. Detecting social spam campaigns on twitter. In Proceedings of the International Conference on Applied Cryptography and Network Security, Singapore, 26–29 June 2012; pp. 455–472. [Google Scholar]
- Ghosh, S.; Viswanath, B.; Kooti, F.; Sharma, N.K.; Korlam, G.; Benevenuto, F.; Ganguly, N.; Gummadi, K.P. Understanding and combating link farming in the twitter social network. In Proceedings of the 21st International Conference on World Wide Web, Lyon, France, 16–20 April 2012; pp. 61–70. [Google Scholar]
- Adewole, K.S.; Anuar, N.B.; Kamsin, A.; Varathan, K.D.; Razak, S.A. Malicious accounts: Dark of the social networks. J. Netw. Comput. Appl. 2017, 79, 41–67. [Google Scholar] [CrossRef]
- Zhu, Y.; Wang, X.; Zhong, E.; Liu, N.N.; Li, H.; Yang, Q. Discovering spammers in social networks. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012. [Google Scholar]
- Lee, S.; Kim, J. Warningbird: A near real-time detection system for suspicious urls in twitter stream. IEEE Trans. Dependable Secur. Comput. 2013, 10, 183–195. [Google Scholar] [CrossRef]
- Grier, C.; Thomas, K.; Paxson, V.; Zhang, M. @ spam: The underground on 140 characters or less. In Proceedings of the 17th ACM Conference on Computer and Communications Security, Chicago, IL, USA, 4–8 October 2010; pp. 27–37. [Google Scholar]
- Thomas, K.; Grier, C.; Ma, J.; Paxson, V.; Song, D. Design and evaluation of a real-time url spam filtering service. In Proceedings of the 2011 IEEE Symposium on Security and Privacy, Oakland, CA, USA, 22–25 May 2011; pp. 447–462. [Google Scholar]
- Wu, T.; Wen, S.; Xiang, Y.; Zhou, W. Twitter spam detection: Survey of new approaches and comparative study. Comput. Secur. 2018, 76, 265–284. [Google Scholar] [CrossRef]
- Ma, J.; Saul, L.K.; Savage, S.; Voelker, G.M. Learning to detect malicious urls. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–24. [Google Scholar] [CrossRef] [Green Version]
- Yardi, S.; Romero, D.; Schoenebeck, G. Detecting spam in a twitter network. First Monday 2010, 15. [Google Scholar] [CrossRef]
- Lee, K.; Caverlee, J.; Webb, S. Uncovering social spammers: Social honeypots+ machine learning. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Geneva, Switzerland, 19–23 July 2010; pp. 435–442. [Google Scholar]
- Benevenuto, F.; Magno, G.; Rodrigues, T.; Almeida, V. Detecting spammers on twitter. In Proceedings of the Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS), Redmond, WA, USA, 13–14 July 2010; Volume 6, p. 12. [Google Scholar]
- Stringhini, G.; Kruegel, C.; Vigna, G. Detecting spammers on social networks. In Proceedings of the 26th Annual Computer Security Applications Conference, Austin, TX, USA, 6–10 December 2010; pp. 1–9. [Google Scholar]
- Wang, A.H. Don’t follow me: Spam detection in twitter. In Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT), Athens, Greece, 26–28 July 2010; pp. 1–10. [Google Scholar]
- Song, J.; Lee, S.; Kim, J. Spam filtering in twitter using sender-receiver relationship. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection, Menlo Park, CA, USA, 20–21 September 2011; pp. 301–317. [Google Scholar]
- Yang, C.; Harkreader, R.; Gu, G. Empirical evaluation and new design for fighting evolving twitter spammers. IEEE Trans. Inf. Forensics Secur. 2013, 8, 1280–1293. [Google Scholar] [CrossRef]
- Mostafa, M.; Abdelwahab, A.; Sayed, H.M. Detecting spam campaign in twitter with semantic similarity. J. Phys. Conf. Ser. 2020, 1447, 12044. [Google Scholar] [CrossRef]
- Sedhai, S.; Sun, A. Hspam14: A collection of 14 million tweets for hashtag-oriented spam research. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 223–232. [Google Scholar]
- Sedhai, S.; Sun, A. Semi-supervised spam detection in Twitter stream. IEEE Trans. Comput. Soc. Syst. 2017, 5, 169–175. [Google Scholar] [CrossRef] [Green Version]
- Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1188–1196. [Google Scholar]
- Hosseinalipour, A.; Ghanbarzadeh, R. A novel approach for spam detection using horse herd optimization algorithm. In Neural Computing & Applications; Springer: New York, NY, USA, 2022. [Google Scholar] [CrossRef]
- Abayomi-Alli, O.; Misra, S.; Abayomi-Alli, A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset. In Concurrency and Computation Practice and Experience; Wiley: Hoboken, NJ, USA, 2022. [Google Scholar] [CrossRef]
- Sitaula, C.; Basnet, A.; Mainali, A.; Shahi, T.B. Deep Learning-Based Methods for Sentiment Analysis on Nepali COVID-19-Related Tweets. Comput. Intell. Neurosci. 2021, 2021, 2158184. [Google Scholar] [CrossRef]
- Shahi, T.B.; Sitaula, C.; Paudel, N. A Hybrid Feature Extraction Method for Nepali COVID-19-Related Tweets Classification. Comput. Intell. Neurosci. 2022, 2022, 5681574. [Google Scholar] [CrossRef]
- Aizawa, A. An information-theoretic perspective of TF–IDF measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
- Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef] [Green Version]
- Fei, S.; Liu, J.; Wu, J.; Pei, C.; Lin, X.; Ou, W.; Jiang, P. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; pp. 1441–1450. [Google Scholar]
- Egele, M.; Stringhini, G.; Kruegel, C.; Vigna, G. Toward detecting compromised accounts on social networks. IEEE Trans. Dependable Secure Comput. 2017, 14, 447–460. [Google Scholar] [CrossRef]
- Chen, C.; Wang, Y.; Zhang, J.; Xiang, Y.; Zhou, W.; Min, G. Statistical features-based real-time detection of drifted twitter spam. IEEE Trans. Inf. Forensics Secur. 2016, 12, 914–925. [Google Scholar] [CrossRef] [Green Version]
- Whole Product Dynamic Real-World Protection Test. 2016. Available online: https://www.av-comparatives.org/testmethod/real-world-protection-tests/ (accessed on 12 August 2020).
- Dasu, T.; Krishnan, S.; Venkatasubramanian, S.; Yi, K. An information-theoretic approach to detecting changes in multi-dimensional data streams. In Proceedings of the Symposium on the Interface of Statistics, Computing Science, and Applications, Pasadena, CA, USA, 24–27 May 2006. [Google Scholar]
- Gama, J.; Žliobaitė, I.; Bifet, A.; Pechenizkiy, M.; Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 2014, 46, 1–37. [Google Scholar] [CrossRef]
- Csiszar, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Chen, C.; Zhang, J.; Xiang, Y.; Zhou, W.; Oliver, J. Spammers are becoming “Smarter” on Twitter. IT Prof. 2016, 18, 66–70. [Google Scholar] [CrossRef]
- Ma, S.; Sun, X.; Li, W.; Li, S.; Li, W.; Ren, X. Query and output: Generating words by querying distributed word representations for paraphrase generation. arXiv 2018, arXiv:1803.01465. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the EMNLP 2014—2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Verma, M.; Sofat, S. Techniques to detect spammers in twitter—A survey. Int. J. Comput. Appl. 2014, 85. [Google Scholar] [CrossRef]
- Zhang, J.; Chen, C.; Chen, X.; Xiang, Y.; Zhou, W. 6 million spam tweets: A large ground truth for timely Twitter spam detection. In Proceedings of the IEEE International Conference on Communications, London, UK, 8–12 June 2015; pp. 7065–7070. [Google Scholar]
- Wang, B.; Zubiaga, A.; Liakata, M.; Procter, R. Making the most of tweet-inherent features for social spam detection on Twitter. arXiv 2015, arXiv:1503.07405. [Google Scholar]
- Madisetty, S.; Desarkar, M.S. A neural network-based ensemble approach for spam detection in Twitter. IEEE Trans. Comput. Soc. Syst. 2018, 5, 973–984. [Google Scholar] [CrossRef]
- Agarap, A.F. An architecture combining convolutional neural network (CNN) and support vector machine (SVM) for image classification. arXiv 2017, arXiv:1712.03541. [Google Scholar]
D-1 vs. D-2 | D-2 vs. D-3 | D-3 vs. D-4 | D-4 vs. D-5 | D-5 vs. D-6 | D-6 vs. D-7 | D-7 vs. D-8 | D-8 vs. D-9 | D-9 vs. D-10 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F-1 | 0.37 | 0.05 | 0.35 | 0.04 | 0.45 | 0.05 | 0.25 | 0.04 | 0.27 | 0.04 | 0.28 | 0.04 | 0.30 | 0.06 | 0.27 | 0.04 | 0.35 | 0.05 |
F-2 | 0.25 | 0.11 | 0.23 | 0.11 | 0.27 | 0.11 | 0.20 | 0.11 | 0.22 | 0.11 | 0.22 | 0.11 | 0.18 | 0.11 | 0.39 | 0.11 | 0.36 | 0.11 |
F-3 | 0.29 | 0.08 | 0.23 | 0.08 | 0.33 | 0.08 | 0.16 | 0.08 | 0.23 | 0.08 | 0.21 | 0.08 | 0.21 | 0.09 | 0.27 | 0.09 | 0.24 | 0.09 |
F-4 | 0.17 | 0.08 | 0.14 | 0.08 | 0.15 | 0.09 | 0.15 | 0.08 | 0.18 | 0.08 | 0.20 | 0.08 | 0.14 | 0.08 | 0.28 | 0.09 | 0.20 | 0.09 |
F-5 | 0.03 | 0.02 | 0.03 | 0.02 | 0.04 | 0.02 | 0.03 | 0.02 | 0.02 | 0.02 | 0.03 | 0.02 | 0.02 | 0.02 | 0.06 | 0.02 | 0.06 | 0.02 |
F-6 | 0.99 | 0.36 | 0.53 | 0.36 | 0.64 | 0.36 | 0.37 | 0.36 | 0.46 | 0.35 | 0.41 | 0.35 | 0.46 | 0.36 | 0.51 | 0.36 | 0.53 | 0.37 |
F-7 | 0.11 | 0.05 | 0.09 | 0.04 | 0.05 | 0.05 | 0.05 | 0.05 | 0.06 | 0.04 | 0.08 | 0.05 | 0.07 | 0.05 | 0.11 | 0.05 | 0.09 | 0.05 |
F-8 | 0.20 | 0 | 0 | 0 | 0.05 | 0 | 0.04 | 0 | 0.03 | 0 | 0.04 | 0 | 0.02 | 0 | 0.05 | 0 | 0.03 | 0 |
F-9 | 0.10 | 0 | 0.04 | 0 | 0.02 | 0 | 0.03 | 0 | 0.02 | 0 | 0.02 | 0 | 0 | 0 | 0.05 | 0 | 0.02 | 0 |
F-10 | 0 | 0 | 0.04 | 0 | 0.04 | 0 | 0.02 | 0 | 0.11 | 0 | 0 | 0 | 0.02 | 0 | 0.33 | 0 | 0.28 | 0 |
F-11 | 0.27 | 0.02 | 0.07 | 0.02 | 0.07 | 0.02 | 0.12 | 0.02 | 0.11 | 0 | 0.1 | 0 | 0.27 | 0.02 | 0.29 | 0.03 | 0.21 | 0.03 |
F-12 | 0.05 | 0 | 0 | 0 | 0.03 | 0 | 0.04 | 0.02 | 0.04 | 0 | 0.05 | 0 | 0.05 | 0 | 0.47 | 0 | 0.47 | 0 |
Feature No. | Title | Description |
---|---|---|
F1 | Age of account | The count of days of an account from the creation date until the last posted tweet |
F2 | Number of followers | The count of followers of this Twitter account |
F3 | Number of followings | The count of friends of this Twitter account |
F4 | Number of user favorites | The count of favorites this Twitter account added |
F5 | Number of lists | The size of lists this Twitter account added |
F6 | Number of tweets | The count of tweets this Twitter account post |
F7 | Number of retweets | The size of retweets for each tweet |
F8 | Number of hashtags | The count of hashtags added in this tweet |
F9 | Number of URLs | The count of user mentions added in this tweet |
F10 | Number of chars | The count of URLs added in this tweet |
F11 | Number of digits | The size of characters in this tweet |
F12 | Number of user mentions | The count of user mentions added in this tweet |
Dataset No. | Type | Spam:Not-Spam |
---|---|---|
1 | random | 200 k:200 k |
2 | continuous | 200 k:200 k |
3 | random | 50 k:1000 k |
4 | continuous | 50 k:1000 k |
Predicted | |||
---|---|---|---|
Actual | spam | Not-spam | |
spam | TP | FP | |
Not-spam | FN | TN |
Method | Precision | Recall | F-Measure | |
---|---|---|---|---|
First module | MaxEntropy | 0.96 | 0.95 | 0.95 |
RandomForest | 0.96 | 0.95 | 0.95 | |
SVM | 0.97 | 0.96 | 0.96 | |
LSTM | 0.95 | 0.96 | 0.95 | |
CNN | 0.92 | 0.95 | 0.93 | |
CNN + SVM | 0.95 | 0.95 | 0.95 | |
Random Forest (user-based feature) | 0.96 | 0.90 | 0.93 | |
SVM (user-based feature) | 0.94 | 0.84 | 0.89 | |
Chen et al. [41] | 0.85 | 0.64 | 0.73 | |
Wang et al. [40] | 0.94 | 0.80 | 0.86 | |
Madisetty et al. [42] | 0.94 | 0.95 | 0.94 | |
Proposed method | 0.96 | 0.96 | 0.96 |
Method | Precision | Recall | F-Measure | |
---|---|---|---|---|
First module | MaxEntropy | 0.96 | 0.95 | 0.95 |
RandomForest | 0.96 | 0.95 | 0.95 | |
SVM | 0.96 | 0.96 | 0.96 | |
LSTM | 0.98 | 0.93 | 0.95 | |
CNN | 0.95 | 0.89 | 0.93 | |
CNN + SVM | 0.97 | 0.89 | 0.93 | |
Random Forest (user-based feature) | 0.60 | 0.70 | 0.65 | |
Chen et al. [41] | 0.58 | 0.67 | 0.62 | |
Wang et al. [40] | 0.79 | 0.76 | 0.77 | |
Madisetty et al. [42] | 0.92 | 0.94 | 0.93 | |
Proposed method | 0.97 | 0.95 | 0.96 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abdelwahab, A.; Mostafa, M. A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam. Appl. Sci. 2022, 12, 6407. https://doi.org/10.3390/app12136407
Abdelwahab A, Mostafa M. A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam. Applied Sciences. 2022; 12(13):6407. https://doi.org/10.3390/app12136407
Chicago/Turabian StyleAbdelwahab, Amira, and Mohamed Mostafa. 2022. "A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam" Applied Sciences 12, no. 13: 6407. https://doi.org/10.3390/app12136407
APA StyleAbdelwahab, A., & Mostafa, M. (2022). A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam. Applied Sciences, 12(13), 6407. https://doi.org/10.3390/app12136407