Cleaning Big Data Streams: A Systematic Literature Review
Abstract
:1. Introduction
2. Systematic Literature Review
2.1. Systematic Literature Review Questions
- RQ1: Why is it important to clean data streams?
- RQ2: Which data cleaning issue is most commonly discussed during the data cleaning process?
- RQ3: What sort of techniques are commonly used to clean data?
- RQ4: What methods have been used to evaluate the proposed approaches?
- RQ5: What are the future directions for data stream cleaning?
2.2. Systematic Literature Review Search Strategy
- “Big data” AND (Clean* OR Stream* OR quality);
- “Big data stream*” AND (Clean* OR Outlier* OR anomal* OR abnormal* OR Duplicat* OR redund* OR Irrelevant);
- “Big data stream*” AND (ML OR DM OR AI);
- “Big data stream*” AND (Missing Value* OR Missing data);
- “Big data stream*” AND Noise.
3. Literature Review
3.1. Artificial Intelligence
3.2. Machine Learning
3.3. Deep Learning
3.4. Statistical Techniques
3.5. Combined Techniques
3.6. Unclassified Techniques
4. Discussion
4.1. RQ1: Why Is It Important to Clean Data Streams?
4.2. RQ2: Which Data Cleaning Issue is Most Commonly Discussed during the Data Cleaning Process?
4.3. RQ3: What Sort of Techniques are Commonly Used to Clean Data?
4.4. RQ4: What Methods Have Been Used to Evaluate the Proposed Approaches?
4.5. RQ5: What Are the Future Directions for Data Cleaning?
4.5.1. Nature of the Data
4.5.2. Outliers
4.5.3. Duplicated Data
4.5.4. Missing Values
4.5.5. Windowing
4.5.6. Framework
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Erl, T.; Khattak, W.; Buhler, P. Big Data Fundamentals: Concepts, Drivers & Techniques; Prentice Hall Press: Upper Saddle River, UJ, USA, 2016. [Google Scholar]
- Kolajo, T.; Daramola, O.; Adebiyi, A. Big data stream analysis: A systematic literature review. J. Big Data 2019, 6, 47. [Google Scholar] [CrossRef]
- Han, J.; Pei, J.; Tong, H. Data Mining: Concepts and Techniques; Morgan kaufmann: Burlington, MA, USA, 2022. [Google Scholar]
- Ridzuan, F.; Zainon, W.M.N.W. A review on data cleansing methods for big data. Procedia Comput. Sci. 2019, 161, 731–738. [Google Scholar] [CrossRef]
- PRISMA. PRISMA Flow Diagram. Available online: http://www.prisma-statement.org (accessed on 1 July 2023).
- Turabieh, H.; Mafarja, M.; Mirjalili, S. Dynamic Adaptive Network-Based Fuzzy Inference System (D-ANFIS) for the Imputation of Missing Data for Internet of Medical Things Applications. IEEE Internet Things J. 2019, 6, 9316–9325. [Google Scholar] [CrossRef]
- Sun, D.; Xue, S.; Wu, H.; Wu, J. A Data Stream Cleaning System Using Edge Intelligence for Smart City Industrial Environments. IEEE Trans. Ind. Inform. 2022, 18, 1165–1174. [Google Scholar] [CrossRef]
- Shao, X.; Zhang, M.; Meng, J. Data Stream Clustering and Outlier Detection Algorithm Based on Shared Nearest Neighbor Density. In Proceedings of the 2018 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS), Xiamen, China, 25–26 January 2018; pp. 279–282. [Google Scholar] [CrossRef]
- Vázquez, F.I.; Zseby, T.; Zimek, A. Outlier Detection Based on Low Density Models. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018; pp. 970–979. [Google Scholar] [CrossRef] [Green Version]
- Yoon, S.; Lee, J.G.; Lee, B.S. NETS: Extremely fast outlier detection from a data stream via set-based processing. Proc. VLDB Endow. 2018, 12, 1303–1315. [Google Scholar] [CrossRef]
- Yuan, G.; Cai, S.; Hao, S. A Novel Weighted Frequent Pattern-Based Outlier Detection Method Applied to Data Stream. In Proceedings of the 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), Chengdu, China, 12–15 April 2019; pp. 503–510. [Google Scholar] [CrossRef]
- Alghushairy, O.; Alsini, R.; Ma, X.; Soule, T. A Genetic-based incremental local outlier factor algorithm for efficient data stream processing. In Proceedings of the 2020 the 4th International Conference on Compute and Data Analysis, San Jose, CA, USA, 9–12 March 2020; pp. 38–49. [Google Scholar] [CrossRef]
- Alsini, R.; Alghushairy, O.; Ma, X.; Soule, T. A Grid Partition-Based Local Outlier Factor by Reachability Distance for Data Stream Processing. In Proceedings of the 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 16–18 December 2020; pp. 369–375. [Google Scholar] [CrossRef]
- Gao, J.; Ji, W.; Zhang, L.; Li, A.; Wang, Y.; Zhang, Z. Cube-based incremental outlier detection for streaming computing. Inf. Sci. 2020, 517, 361–376. [Google Scholar] [CrossRef]
- Moon, A.; Zhuo, X.; Zhang, J.; Son, S.W.; Song, Y.J. Anomaly Detection in Edge Nodes using Sparsity Profile. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 1236–1245. [Google Scholar] [CrossRef]
- Yu, Y.; Wu, X.; Yuan, S. Anomaly Detection for Internet of Things Based on Compressed Sensing and Online Extreme Learning Machine Autoencoder. J. Phys. Conf. Ser. 2020, 1544, 012027. [Google Scholar] [CrossRef]
- Zhu, R.; Ji, X.; Yu, D.; Tan, Z.; Zhao, L.; Li, J.; Xia, X. KNN-Based Approximate Outlier Detection Algorithm Over IoT Streaming Data. IEEE Access 2020, 8, 42749–42759. [Google Scholar] [CrossRef]
- Gruhl, C.; Tomforde, S. OHODIN—Online Anomaly Detection for Data Streams. In Proceedings of the 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C), Washington DC, USA, 27 September–1 October 2021; pp. 193–197. [Google Scholar] [CrossRef]
- Togbe, M.U.; Chabchoub, Y.; Boly, A.; Barry, M.; Chiky, R.; Bahri, M. Anomalies detection using isolation in concept-drifting data streams. Computers 2021, 10, 13. [Google Scholar] [CrossRef]
- Wang, Q.; Yan, B.; Su, H.; Zheng, H. Anomaly Detection for Time Series Data Stream. In Proceedings of the 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), Xiamen, China, 5–8 March 2021; pp. 118–122. [Google Scholar] [CrossRef]
- Zhao, Z.; Birke, R.; Han, R.; Robu, B.; Bouchenak, S.; Mokhtar, S.B.; Chen, L.Y. Enhancing Robustness of On-Line Learning Models on Highly Noisy Data. IEEE Trans. Dependable Secur. Comput. 2021, 18, 2177–2192. [Google Scholar] [CrossRef]
- Ariyaluran Habeeb, R.A.; Nasaruddin, F.; Gani, A.; Amanullah, M.A.; Abaker Targio Hashem, I.; Ahmed, E.; Imran, M. Clustering-based real-time anomaly detection—A breakthrough in big data technologies. Trans. Emerg. Telecommun. Technol. 2022, 33, e3647. [Google Scholar] [CrossRef]
- Jiang, Y.G.; Kang, C.; Shen, Y.; Huang, T.T.; Zhai, G.D. Research on Argo Data Anomaly Detection Based on Improved DBSCAN Algorithm. In Proceedings of the China Conference on Wireless Sensor Networks, Singapore, 10 November 2022; pp. 44–54. [Google Scholar] [CrossRef]
- Benjelloun, F.-Z.; Oussous, A.; Bennani, A.; Belfkih, S.; Ait Lahcen, A. Improving outliers detection in data streams using LiCS and voting. J. King Saud Univ. Comput. Inf. Sci. 2021, 33, 1177–1185. [Google Scholar] [CrossRef]
- Xu, X.; Lei, Y.; Li, Z. An Incorrect Data Detection Method for Big Data Cleaning of Machinery Condition Monitoring. IEEE Trans. Ind. Electron. 2020, 67, 2326–2336. [Google Scholar] [CrossRef]
- Najib, F.M.; Ismail, R.M.; Badr, N.L.; Gharib, T.F. Clustering based approach for incomplete data streams processing. J. Intell. Fuzzy Syst. 2020, 38, 3213–3227. [Google Scholar] [CrossRef]
- Shen, L.; He, X.; Liu, M.; Qin, R.; Guo, C.; Meng, X.; Duan, R. A Flexible Ensemble Algorithm for Big Data Cleaning of PMUs. Front. Energy Res. 2021, 9, 695057. [Google Scholar] [CrossRef]
- Lizhen, W.; Yifan, Z.; Gang, W.; Xiaohong, H. A novel short-term load forecasting method based on mini-batch stochastic gradient descent regression model. Electr. Power Syst. Res. 2022, 211, 108226. [Google Scholar] [CrossRef]
- Liu, J.; Cao, Y.; Li, Y.; Guo, Y.; Deng, W. A big data cleaning method based on improved CLOF and Random Forest for distribution network. CSEE J. Power Energy Syst. 2020, 1–10. [Google Scholar] [CrossRef]
- Thakur, S.; Dharavath, R. KMDT: A hybrid cluster approach for anomaly detection using big data. In Proceedings of the Information and Decision Sciences: Proceedings of the 6th International Conference on FICTA, Singapore, 14–17 October 2017; pp. 169–176. [Google Scholar] [CrossRef]
- Heigl, M.; Anand, K.A.; Urmann, A.; Fiala, D.; Schramm, M.; Hable, R. On the improvement of the isolation forest algorithm for outlier detection with streaming data. Electronics 2021, 10, 1534. [Google Scholar] [CrossRef]
- Rivera, J.J.D.; Khan, T.A.; Akbar, W.; Afaq, M.; Song, W.C. An ML Based Anomaly Detection System in real-time data streams. In Proceedings of the 2021 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 15–17 December 2021; pp. 1329–1334. [Google Scholar] [CrossRef]
- Degirmenci, A.; Karal, O. Efficient density and cluster based incremental outlier detection in data streams. Inf. Sci. 2022, 607, 901–920. [Google Scholar] [CrossRef]
- Panneerselvam, M.; Neela, K.; Rajeshwari, R.; Vengadapathiraj, M.; Sobitha, S.; Mohanavel, V. A Novel Approach to Identify the Anomaly Detection in Electricity usage based on Machine Learning Algorithms and Big Data. In Proceedings of the 2022 3rd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 20–22 October 2022; pp. 1393–1400. [Google Scholar] [CrossRef]
- Prabhakar, T.S.; Veena, M.N. Efficient anomaly detection using deer hunting optimization algorithm via adaptive deep belief neural network in mobile network. J. Ambient Intell. Humaniz. Comput. 2022, 1–17. [Google Scholar] [CrossRef]
- Pei, C.; Zhang, S.; Zeng, X. Research on anomaly detection of wireless data acquisition in power system based on spark. Energy Rep. 2022, 8, 1392–1404. [Google Scholar] [CrossRef]
- Xu, B. Power Station Abnormal Data Cleaning Method Based On Big Data Mining. In Proceedings of the 2021 IEEE Sustainable Power and Energy Conference (iSPEC), Nanjing, China, 23 December 2021; pp. 3809–3814. [Google Scholar]
- Andreoni Lopez, M.; Mattos, D.M.F.; Duarte, O.C.M.B.; Pujolle, G. A fast unsupervised preprocessing method for network monitoring. Ann. Des Telecommun./Ann. Telecommun. 2019, 74, 139–155. [Google Scholar] [CrossRef]
- Zhang, X.; Lin, R.; Xu, H. An Adaptive Parameters Density Cluster Algorithm for Data Cleaning in Big Data. In Proceedings of the Artificial Intelligence and Security: 6th International Conference, ICAIS 2020, Hohhot, China, 17–20 July 2020; pp. 543–553. [Google Scholar] [CrossRef]
- Fitters, W.; Cuzzocrea, A.; Hassani, M. Enhancing LSTM prediction of vehicle traffic flow data via outlier correlations. In Proceedings of the 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 12–16 July 2021; pp. 210–217. [Google Scholar] [CrossRef]
- Arora, S.; Rani, R.; Saxena, N. An efficient approach for detecting anomalous events in real-time weather datasets. Concurr. Comput. Pract. Exp. 2022, 34, e6707. [Google Scholar] [CrossRef]
- Iturria, A.; Labaien, J.; Charramendieta, S.; Lojo, A.; Del Ser, J.; Herrera, F. A framework for adapting online prediction algorithms to outlier detection over time series. Knowl.-Based Syst. 2022, 256, 109823. [Google Scholar] [CrossRef]
- Wang, Y.; Perry, M.; Whitlock, D.; Sutherland, J.W. Detecting anomalies in time series data from a manufacturing system using recurrent neural networks. J. Manuf. Syst. 2022, 62, 823–834. [Google Scholar] [CrossRef]
- Zhou, Y.; Xu, K.; He, F.; Zhang, Z. Online abnormal interval detection and classification of industrial time series data based on multi-scale deep learning. J. Taiwan Inst. Chem. Eng. 2022, 138, 104445. [Google Scholar] [CrossRef]
- Albattah, A.; Rassam, M.A. A Correlation-Based Anomaly Detection Model for Wireless Body Area Networks Using Convolutional Long Short-Term Memory Neural Network. Sensors 2022, 22, 1951. [Google Scholar] [CrossRef] [PubMed]
- Belacel, N.; Richard, R.; Xu, Z.M. An LSTM Encoder-Decoder Approach for Unsupervised Online Anomaly Detection in Machine Learning Packages for Streaming Data. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 3348–3357. [Google Scholar] [CrossRef]
- Gao, Y.; Yin, X.; He, Z.; Wang, X. A deep learning process anomaly detection approach with representative latent features for low discriminative and insufficient abnormal data. Comput. Ind. Eng. 2023, 176, 108936. [Google Scholar] [CrossRef]
- Smrithy, G.S.; Balakrishnan, R. Automated modeling of real real-time anomaly detection using non -parametric statistical technique for data streams in cloud environments. J. Commun. Softw. Syst. 2019, 15, 225–232. [Google Scholar] [CrossRef]
- Yu, K.; Shi, W.; Santoro, N.; Ma, X. Real-time Outlier Detection Over Streaming Data. In Proceedings of the 2019 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), Leicester, UK, 19–23 August 2019. [Google Scholar]
- Karn, R.; Joshi, S.R.; Bista, U.; Joshi, B.; Baral, D.S.; Shakya, A. Anomaly Detection in Distributed Streams. Inf. Commun. Technol. Intell. Syst. 2021, 196, 139–147. [Google Scholar] [CrossRef]
- Jamshidi, E.J.; Yusup, Y.; Kayode, J.S.; Kamaruddin, M.A. Detecting outliers in a univariate time series dataset using unsupervised combined statistical methods: A case study on surface water temperature. Ecol. Inform. 2022, 69, 101672. [Google Scholar] [CrossRef]
- Kurt, M.N.; Yılmaz, Y.; Wang, X. Sequential Model-Free Anomaly Detection for Big Data Streams. In Proceedings of the 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 24–27 September 2019; pp. 421–425. [Google Scholar]
- Bobulski, J.; Kubanek, M. A method of cleaning data from IoT devices in Big data systems. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 6596–6598. [Google Scholar]
- Kulanuwat, L.; Chantrapornchai, C.; Maleewong, M.; Wongchaisuwat, P.; Wimala, S.; Sarinnapakorn, K.; Boonya-Aroonnet, S. Anomaly detection using a sliding window technique and data imputation with machine learning for hydrological time series. Water 2021, 13, 1862. [Google Scholar] [CrossRef]
- Fountas, P.; Kolomvatsos, K. A Continuous Data Imputation Mechanism based on Streams Correlation. In Proceedings of the 2020 IEEE Symposium on Computers and Communications (ISCC), Rennes, France, 7–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Zhao, X.; Jia, K.; Letcher, B.; Fair, J.; Xie, Y.; Jia, X. VIMTS: Variational-based Imputation for Multi-modal Time Series. In Proceedings of the 2022 IEEE International Conference on Big Data (Big Data), Osaka, Japan, 17–20 December 2022; pp. 349–358. [Google Scholar] [CrossRef]
- Bimonte, S.; Ren, L.; Koueya, N. A linear programming-based framework for handling missing data in multi-granular data warehouses. Data Knowl. Eng. 2020, 128, 101832. [Google Scholar] [CrossRef]
- Fang, J. Research on automatic cleaning algorithm of multi-dimensional network redundant data based on big data. Evol. Intell. 2022, 15, 2609–2617. [Google Scholar] [CrossRef]
- Jehlol, H.B.; George, L.E. Big Data De-duplication Using Classification Scheme based on Histogram of File Stream. In Proceedings of the 2022 International Conference on Intelligent Technology, System and Service for Internet of Everything (ITSS-IoE), Hadhramaut, Yemen, 3–5 December 2022; pp. 1–7. [Google Scholar] [CrossRef]
- Xiao, B.; Wang, Z.; Liu, Q.; Liu, X. SMK-means: An improved mini batch k-means algorithm based on mapreduce with big data. Comput. Mater. Contin. 2018, 56, 365–379. [Google Scholar] [CrossRef]
- Sun, H.; He, Q.; Liao, K.; Sellis, T.; Guo, L.; Zhang, X.; Shen, J.; Chen, F. Fast Anomaly Detection in Multiple Multi-Dimensional Data Streams. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 1218–1223. [Google Scholar] [CrossRef]
- Reunanen, N.; Räty, T.; Jokinen, J.J.; Hoyt, T.; Culler, D. Unsupervised online detection and prediction of outliers in streams of sensor data. Int. J. Data Sci. Anal. 2020, 9, 285–314. [Google Scholar] [CrossRef] [Green Version]
- Crépey, S.; Lehdili, N.; Madhar, N.; Thomas, M. Anomaly Detection in Financial Time Series by Principal Component Analysis and Neural Networks. Algorithms 2022, 15, 385. [Google Scholar] [CrossRef]
- Huang, Y.; Du, F.; Chen, J.; Chen, Y.; Wang, Q.; Li, M. Generalized Pareto Model Based on Particle Swarm Optimization for Anomaly Detection. IEEE Access 2019, 7, 176329–176338. [Google Scholar] [CrossRef]
- Surapaneni, R.K.; Nimmagadda, S.; Pragathi, K. Unsupervised Classification Approach for Anomaly Detection in Big Data Streams. Lect. Notes Netw. Syst. 2021, 201, 71–79. [Google Scholar] [CrossRef]
- Zhang, J.; Wang, C.; Li, Z.; Zhang, X. Threshold-free Anomaly Detection for Streaming Time Series through Deep Learning. In Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), Pasadena, CA, USA, 13–16 December 2021; pp. 1783–1789. [Google Scholar] [CrossRef]
- García-Gil, D.; Luengo, J.; García, S.; Herrera, F. Enabling Smart Data: Noise filtering in Big Data classification. Inf. Sci. 2019, 479, 135–152. [Google Scholar] [CrossRef]
- Ma, J.; Cheng, J.C.P.; Jiang, F.; Chen, W.; Wang, M.; Zhai, C. A bi-directional missing data imputation scheme based on LSTM and transfer learning for building energy data. Energy Build. 2020, 216, 109941. [Google Scholar] [CrossRef]
- Li, G.; Wang, J.; Liang, J.; Yue, C. The application of a double CUSUM algorithm in industrial data stream anomaly detection. Symmetry 2018, 10, 264. [Google Scholar] [CrossRef] [Green Version]
- Rollo, F.; Bachechi, C.; Po, L. Semi Real-time Data Cleaning of Spatially Correlated Data in Traffic Sensor Networks. In Proceedings of the 18th International Conference on Web Information Systems and Technologies-WEBIST, Valetta, Malta, 25–27 October 2022; pp. 83–94. [Google Scholar]
- Zhu, Y.; Xie, C. Edge-Cloud Hybrid Tiny Data Reduction Model for Anomaly Detection. In Proceedings of the 2022 IEEE International Conference on e-Business Engineering (ICEBE), Bournemouth, UK, 14–16 October 2022; pp. 51–57. [Google Scholar] [CrossRef]
- Yang, C.; Du, Z.; Meng, X.; Zhang, X.; Hao, X.; Bader, D.A. Anomaly Detection in Catalog Streams. IEEE Trans. Big Data 2023, 9, 294–311. [Google Scholar] [CrossRef]
- Amen, B.; Grigoris, A. Collective Anomaly Detection Using Big Data Distributed Stream Analytics. In Proceedings of the 2018 14th International Conference on Semantics, Knowledge and Grids (SKG), Guangzhou, China, 12–14 September 2018; pp. 188–195. [Google Scholar] [CrossRef]
- Chen, Z.; Yu, X.; Ling, Y.; Song, B.; Quan, W.; Hu, X.; Yan, E. Correlated Anomaly Detection from Large Streaming Data. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 982–992. [Google Scholar] [CrossRef]
- Manjunatha, H.C.; Mohanasundaram, R. BRNADS: Big data real-time node anomaly detection in social networks. In Proceedings of the 2018 2nd International Conference on Inventive Systems and Control (ICISC), Coimbatore, India, 19–20 January 2018; pp. 929–932. [Google Scholar] [CrossRef]
- Su, S.; Xiao, L.; Ruan, L.; Xu, R.; Li, S.; Wang, Z.; He, Q.; Li, W. ADCMO: An Anomaly Detection Approach Based on Local Outlier Factor for Continuously Monitored Object. In Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, China, 16–18 December 2019; pp. 865–874. [Google Scholar] [CrossRef]
- Cao, K.; Liu, Y.; Meng, G.; Liu, H.; Miao, A.; Xu, J. Trajectory Outlier Detection on Trajectory Data Streams. IEEE Access 2020, 8, 34187–34196. [Google Scholar] [CrossRef]
- Dias, R.; Mauricio, L.A.F.; Poggi, M. Toward an Efficient Real-Time Anomaly Detection System for Cloud Datacenters. In Proceedings of the 2020 IFIP Networking Conference (Networking), Paris, France, 22–26 June 2020; pp. 529–533. [Google Scholar]
- Borah, A.; Gruenwald, L.; Leal, E.; Panjei, E. A GPU Algorithm for Detecting Contextual Outliers in Multiple Concurrent Data Streams. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 2737–2742. [Google Scholar] [CrossRef]
- Dani, Y.; Gunawan, A.Y.; Indratno, S.W. Detecting Online Outlier for Data Streams using Recursive Residual. In Proceedings of the 2022 Seventh International Conference on Informatics and Computing (ICIC), Denpasar, Bali, Indonesia, 8–9 December 2022. [Google Scholar] [CrossRef]
- Leigh, C.; Alsibai, O.; Hyndman, R.J.; Kandanaarachchi, S.; King, O.C.; McGree, J.M.; Neelamraju, C.; Strauss, J.; Talagala, P.D.; Turner, R.D.R.; et al. A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. Sci. Total Environ. 2019, 664, 885–898. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Souza, T.I.A.; Aquino, A.L.L.; Gomes, D.G. A method to detect data outliers from smart urban spaces via tensor analysis. Future Gener. Comput. Syst. 2019, 92, 290–301. [Google Scholar] [CrossRef]
- Gupta, G.P.; Khedwal, J. Framework for Error Detection & its Localization in Sensor Data Stream for reliable big sensor data analytics using Apache Spark Streaming. Procedia Comput. Sci. 2020, 167, 2337–2342. [Google Scholar] [CrossRef]
- Zheng, H.; Tian, B.; Liu, X.; Zhang, W.; Liu, S.; Wang, C. Data Quality Identification Model for Power Big Data. In Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators, Singapore, 10 August 2022; pp. 20–29. [Google Scholar] [CrossRef]
- Wang, T.; Ke, H.; Zheng, X.; Wang, K.; Sangaiah, A.K.; Liu, A. Big Data Cleaning Based on Mobile Edge Computing in Industrial Sensor-Cloud. IEEE Trans. Ind. Inform. 2020, 16, 1321–1329. [Google Scholar] [CrossRef]
- You, D.; Wu, X.; Shen, L.; Chen, Z.; Ma, C.; Deng, S. Online Feature Selection for Streaming Features with High Redundancy Using Sliding-Window Sampling. In Proceedings of the 2018 IEEE International Conference on Big Knowledge (ICBK), Hefei, China, 9–10 August 2017; pp. 205–212. [Google Scholar] [CrossRef]
- Pezoulas, V.C.; Kourou, K.D.; Kalatzis, F.; Exarchos, T.P.; Venetsanopoulou, A.; Zampeli, E.; Gandolfo, S.; Skopouli, F.; De Vita, S.; Tzioufas, A.G.; et al. Medical data quality assessment: On the development of an automated framework for medical data curation. Comput. Biol. Med. 2019, 107, 270–283. [Google Scholar] [CrossRef]
- Salloum, S.; Huang, J.Z.; He, Y. Exploring and cleaning big data with random sample data blocks. J. Big Data 2019, 6, 45. [Google Scholar] [CrossRef]
- Ju, X.; Lian, F.; Zhang, Y. Data Cleaning Optimization for Grain Big Data Processing using Task Merging. In Proceedings of the 2019 6th International Conference on Information Science and Control Engineering (ICISCE), Shanghai, China, 20–22 December 2019; pp. 225–233. [Google Scholar]
- Ding, X.; Qin, S. Iteratively modeling based cleansing interactively samples of big data. In Proceedings of the Cloud Computing and Security: 4th International Conference, ICCCS 2018, Haikou, China, 8–10 June 2018; pp. 601–612. [Google Scholar] [CrossRef]
- Rama Satish, K.V.; Kavya, N.P. Hybrid optimization in big data: Error detection and data repairing by big data cleaning using CSO-GSA. In Proceedings of the International Conference on Cognitive Computing and Information Processing, Bengaluru, India, 15–16 December 2017; Springer: Berlin, Germany, 2018; Volume 801, pp. 258–273. [Google Scholar] [CrossRef]
Aspect | Stream Processing | Batch Processing |
---|---|---|
Data size | Unknown | Known |
Performance | Limited time, it can be seconds or milliseconds | No limit, it can be hours or days |
Dataset type | Unbounded | Bounded |
Processing | It is processed only once | It can be processed many times |
Example | E-commerce transactions | Payroll system |
Query | IEEE Xplore | ACM Library | Scopus | ScienceDirect |
---|---|---|---|---|
1 | 12,609 | 17,275 | 233,243 | 57,495 |
2 | 420 | 68 | 1096 | 174 |
3 | 160 | 21 | 645 | 116 |
4 | 63 | 76 | 101 | 102 |
5 | 115 | 19 | 304 | 125 |
Total | 324,227 |
Query | IEEE Xplore | ACM Library | Scopus | ScienceDirect |
---|---|---|---|---|
1 | 157 | 18 | 496 | 1505 |
2 | 271 | 34 | 920 | 137 |
3 | 142 | 13 | 568 | 99 |
4 | 38 | 37 | 90 | 77 |
5 | 68 | 12 | 241 | 90 |
Total | 5013 |
Criteria | Eligible | Ineligible |
---|---|---|
Written language | English | Other languages |
Study | Complete study | Incomplete study |
References | Reliable references | Unreliable references |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alotaibi, O.; Pardede, E.; Tomy, S. Cleaning Big Data Streams: A Systematic Literature Review. Technologies 2023, 11, 101. https://doi.org/10.3390/technologies11040101
Alotaibi O, Pardede E, Tomy S. Cleaning Big Data Streams: A Systematic Literature Review. Technologies. 2023; 11(4):101. https://doi.org/10.3390/technologies11040101
Chicago/Turabian StyleAlotaibi, Obaid, Eric Pardede, and Sarath Tomy. 2023. "Cleaning Big Data Streams: A Systematic Literature Review" Technologies 11, no. 4: 101. https://doi.org/10.3390/technologies11040101
APA StyleAlotaibi, O., Pardede, E., & Tomy, S. (2023). Cleaning Big Data Streams: A Systematic Literature Review. Technologies, 11(4), 101. https://doi.org/10.3390/technologies11040101