Research on Data Cleaning Algorithm Based on Multi Type Construction Waste
Abstract
:1. Introduction
2. Materials and Methods
2.1. Algorithm Flow
2.2. Multi-Algorithm Constrained Model
2.3. Natural Language Data Cleaning Model
2.4. Time Series Data Cleaning Model
3. Results and Evaluation
3.1. Experimental Data
3.2. Experimental Results
3.3. Result Analysis
3.3.1. Construction Waste Natural Language Data
3.3.2. Construction Waste Data Time Series Data
3.3.3. Comparative Analysis of the Cleaning Effect
4. Discussion
4.1. Cleaning Model Design and Advantages
4.2. Design and Advantages of the Multi-Algorithm Constrained Models
4.3. Limitations of the Algorithm
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ma, X.C.; Luo, W.J.; Yin, J. Review and feasibility analysis of prefabricated recycled concrete structure. In Proceedings of the IOP Conference Series: Earth and Environmental Science, Qingdao, China, 5–7 June 2020; IOP Publishing: Bristol, UK, 2020; Volume 531, p. 012052. [Google Scholar]
- Long, H.; Xu, S.; Gu, W. An abnormal wind turbine data cleaning algorithm based on color space conversion and image feature detection. Appl. Energy 2022, 311, 118594. [Google Scholar] [CrossRef]
- Hwang, J.S.; Mun, S.D.; Kim, T.J. Development of Data Cleaning and Integration Algorithm for Asset Management of Power System. Energies 2022, 15, 1616. [Google Scholar] [CrossRef]
- Candelotto, L.; Grethen, K.J.; Montalcini, C.M. Tracking performance in poultry is affected by data cleaning method and housing system. Appl. Anim. Behav. Sci. 2022, 249, 105597. [Google Scholar] [CrossRef]
- Gao, F.; Li, J.; Ge, Y. A Trajectory Evaluator by Sub-tracks for Detecting VOT-based Anomalous Trajectory. ACM Trans. Knowl. Discov. Data TKDD 2022, 16, 1–19. [Google Scholar] [CrossRef]
- Liu, H.; Shah, S.; Jiang, W. On-line outlier detection and data cleaning. Comput. Chem. Eng. 2004, 28, 1635–1647. [Google Scholar] [CrossRef]
- Corrales, D.C.; Ledezma, A.; Corrales, J.C. A case-based reasoning system for recommendation of data cleaning algorithms in classification and regression tasks. Appl. Soft Comput. 2020, 90, 106180. [Google Scholar] [CrossRef]
- Luo, Z.; Fang, C.; Liu, C.; Liu, S. Method for Cleaning Abnormal Data of Wind Turbine Power Curve Based on Density Clustering and Boundary Extraction. IEEE Trans. Sustain. Energy 2021, 13, 1147–1159. [Google Scholar] [CrossRef]
- Ji, C.B.; Duan, G.J.; Zhou, J.Y. Equipment Quality Data Integration and Cleaning Based on Multiterminal Collaboration. Complexity 2021, 2021, 5943184. [Google Scholar] [CrossRef]
- Yuan, J.; Zhou, Z.; Huang, K. Analysis and evaluation of the operation data for achieving an on-demand heating consumption prediction model of district heating substation. Energy 2021, 214, 118872. [Google Scholar] [CrossRef]
- Shi, X.; Prins, C.; Van Pottelbergh, G.; Mamouris, P.; Vaes, B.; De Moor, B. An automated data cleaning method for Electronic Health Records by incorporating clinical knowledge. BMC Med. Inform. Decis. Mak. 2021, 21, 267. [Google Scholar] [CrossRef]
- Dutta, V.; Haldar, S.; Kaur, P. Comparative Analysis of TOPSIS and TODIM for the Performance Evaluation of Foreign Players in Indian Premier League. Complexity 2022, 2022, 9986137. [Google Scholar] [CrossRef]
- Fa, P.; Qiu, Z.; Wang, Q.E.; Yan, C.; Zhang, J. A Novel Role for RNF126 in the Promotion of G2 Arrest via Interaction With 14–3-3σ. Int. J. Radiat. Oncol. Biol. Phys. 2022, 112, 542–553. [Google Scholar] [CrossRef] [PubMed]
- Zeng, B.; Sun, Y.; Xie, S. Application of LSTM algorithm combined with Kalman filter and SOGI in phase-locked technology of aviation variable frequency power supply. PLoS ONE 2022, 17, e0263634. [Google Scholar] [CrossRef]
- Fang, K.; Wang, T.; Zhou, X. A TOPSIS-based relocalization algorithm in wireless sensor networks. IEEE Trans. Ind. Inform. 2021, 18, 1322–1332. [Google Scholar] [CrossRef]
- Shohda, A.M.A.; Ali, M.A.M.; Ren, G. Sustainable Assignment of Egyptian Ornamental Stones for Interior and Exterior Building Finishes Using the AHP-TOPSIS Technique. Sustainability 2022, 14, 2453. [Google Scholar] [CrossRef]
- Zhang, X.; Xu, Z. Extension of TOPSIS to multiple criteria decision making with Pythagorean fuzzy sets. Int. J. Intell. Syst. 2014, 29, 1061–1078. [Google Scholar] [CrossRef]
- Polcyn, J. Determining Value Added Intellectual Capital (VAIC) Using the TOPSIS-CRITIC Method in Small and Medium-Sized Farms in Selected European Countries. Sustainability 2022, 14, 3672. [Google Scholar] [CrossRef]
- Liu, Y.; Wang, L.; Shi, T. Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM. Inf. Syst. 2022, 103, 101865. [Google Scholar] [CrossRef]
- Korkmaz, M.; Kocyigit, E.; Sahingoz, O.K. Phishing web page detection using N-gram features extracted from URLs. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 11–13 June 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
- Chaabi, Y.; Allah, F.A. Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram. J. King Saud Univ.-Comput. Inf. Sci. 2021, 34, 6116–6124. [Google Scholar] [CrossRef]
- Ghude, T. N-gram models for Text Generation in Hindi Language. In ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2022; Volume 44, p. 03062. [Google Scholar]
- Song, Y. Zen 2.0: Continue training and adaption for n-gram enhanced text encoders. arXiv 2021, arXiv:2105.01279. [Google Scholar]
- Zhu, L. A N-gram based approach to auto-extracting topics from research articles. J. Intell. Fuzzy Syst. 2021. preprint. [Google Scholar]
- Tian, J. Improving Mandarin End-to-End Speech Recognition with Word N-Gram Language Model. IEEE Signal Process. Lett. 2022, 29, 812–816. [Google Scholar] [CrossRef]
- Sester, J.; Hayes, D.; Scanlon, M. A comparative study of support vector machine and neural networks for file type identification using n-gram analysis. Forensic Sci. Int. Digit. Investig. 2021, 36, 301121. [Google Scholar] [CrossRef]
- Aouragh, S.L.; Yousfi, A.; Laaroussi, S. A new estimate of the n-gram language model. Procedia Comput. Sci. 2021, 189, 211–215. [Google Scholar] [CrossRef]
- Szymborski, J.; Emad, A. RAPPPID: Towards generalizable protein interaction prediction with AWD-LSTM twin networks. Bioinformatics 2022, 38, 3958–3967. [Google Scholar] [CrossRef] [PubMed]
- Wang, X.; Xu, N. Meng X; Prediction of Gas Concentration Based on LSTM-Light GBM Variable Weight Combination Model. Energies 2022, 15, 827. [Google Scholar] [CrossRef]
- Liu, M.Z.; Xu, X.; Hu, J. Real time detection of driver fatigue based on CNN-LSTM. IET Image Process. 2022, 16, 576–595. [Google Scholar] [CrossRef]
- Akhter, M.N.; Mekhilef, S.; Mokhlis, H. An Hour-Ahead PV Power Forecasting Method Based on an RNN-LSTM Model for Three Different PV Plants. Energies 2022, 15, 2243. [Google Scholar] [CrossRef]
- Jogunola, O.; Adebisi, B.; Hoang, K.V. CBLSTM-AE: A Hybrid Deep Learning Framework for Predicting Energy Consumption. Energies 2022, 15, 810. [Google Scholar] [CrossRef]
- Chung, W.H.; Gu, Y.H.; Yoo, S.J. District heater load forecasting based on machine learning and parallel CNN-LSTM attention. Energy 2022, 246, 123350. [Google Scholar] [CrossRef]
- Tao, C.; Lu, J.; Lang, J. Short-Term forecasting of photovoltaic power generation based on feature selection and bias Compensation–LSTM network. Energies 2021, 14, 3086. [Google Scholar] [CrossRef]
- Ordóñez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S. Fusing geometric features for skeleton-based action recognition using multilayer LSTM networks. IEEE Trans. Multimed. 2018, 20, 2330–2343. [Google Scholar] [CrossRef]
- Zhao, F.; Feng, J.; Zhao, J. Robust LSTM-autoencoders for face de-occlusion in the wild. IEEE Trans. Image Process. 2017, 27, 778–790. [Google Scholar] [CrossRef] [PubMed]
- Liu, Q.; Zhou, F.; Hang, R. Bidirectional-convolutional LSTM based spectral-spatial feature learning for hyperspectral image classification. Remote Sens. 2017, 9, 1330. [Google Scholar] [CrossRef]
- Wentz, V.H.; Maciel, J.N.; Gimenez Ledesma, J.J. Solar Irradiance Forecasting to Short-Term PV Power: Accuracy Comparison of ANN and LSTM Models. Energies 2022, 15, 2457. [Google Scholar] [CrossRef]
- Banik, S.; Sharma, N.; Mangla, M. LSTM based decision support system for swing trading in stock market. Knowl.-Based Syst. 2022, 239, 107994. [Google Scholar] [CrossRef]
- Hwang, J.S.; Kim, J.S.; Song, H. Handling Load Uncertainty during On-Peak Time via Dual ESS and LSTM with Load Data Augmentation. Energies 2022, 15, 3001. [Google Scholar] [CrossRef]
- Rosas, M.A.T.; Pérez, M.R.; Pérez, E.R.M. Itineraries for charging and discharging a BESS using energy predictions based on a CNN-LSTM neural network model in BCS, Mexico. Renew. Energy 2022, 188, 1141–1165. [Google Scholar] [CrossRef]
- Maleki, S.; Maleki, S.; Jennings, N.R. Unsupervised anomaly detection with LSTM autoencoders using statistical data-filtering. Appl. Soft Comput. 2021, 108, 107443. [Google Scholar] [CrossRef]
- Sinaga, K.P.; Yang, M.S. Unsupervised K-means clustering algorithm. IEEE Access 2020, 8, 80716–80727. [Google Scholar] [CrossRef]
- Marcot, B.G.; Penman, T.D. Advances in Bayesian network modelling: Integration of modelling technologies. Environ. Model. Softw. 2019, 111, 386–393. [Google Scholar] [CrossRef]
- Liu, Y.; Lou, Y.; Huang, S. Parallel algorithm of flow data anomaly detection based on isolated forest. In Proceedings of the 2020 International Conference on Artificial Intelligence and Electromechanical Automation (AIEA), Tianjin, China, 26–28 June 2020; IEEE: New York, NY, USA, 2020; pp. 132–135. [Google Scholar]
- Zhang, J.Z.; Fang, Z.; Xiong, Y.J.; Yuan, X.Y. Optimization algorithm for cleaning data based on SNM. J. Cent. South Univ. Sci. Technol. 2010, 41, 2240–2245. [Google Scholar]
- Martini, A.; Kuper, P.V.; Breunig, M. Database-Supported Change Analysis and Quality Evaluation of Openstreet map Data. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, 4, 535–541. [Google Scholar] [CrossRef] [Green Version]
- Save, A.M.; Kolkur, S. Hybrid Technique for Data Cleaning. Int. J. Comput. Appl. 2014, 975, 8887. [Google Scholar]
- Chaudhuri, S.; Ganti, V.; Kaushik, R. A primitive operator for similarity joins in data cleaning. In Proceedings of the 22nd International Conference on Data Engineering (ICDE’06), Atlanta, GA, USA, 3–7 April 2006; IEEE: New York, NY, USA, 2006; p. 5. [Google Scholar]
- Tang, J.; Li, H.; Cao, Y. Email data cleaning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, IL, USA, 21–24 August 2005; pp. 489–498. [Google Scholar]
Indicator Trend Type | Indicator Features | Example |
---|---|---|
Very large indicator | The bigger the better | Grades, GDP, etc. |
Very small indicator | The smaller the better | Expenses, defective rate, etc. |
Intermediate indicator | The closer to a certain value the better | PH value for water quality assessment |
Interval indicator | Within a certain range is better | Body temperature, etc. |
Inventory Check | Meeting Minutes | Leadership Team | Implementation | Landfill | Regulatory | Special Planning | Resource Facility | Overall Progress | Total | |
---|---|---|---|---|---|---|---|---|---|---|
The amount of dirty data in the natural language data | 85 | 70 | 15 | 112 | 213 | 102 | 255 | 159 | 51 | 1062 |
Number of errors in the natural language data | 9 | 7 | 2 | 13 | 23 | 13 | 29 | 22 | 4 | 122 |
Correct number of natural language data | 76 | 63 | 13 | 99 | 190 | 89 | 226 | 137 | 47 | 940 |
Missing natural language data | 10 | 8 | 2 | 12 | 13 | 5 | 12 | 11 | 5 | 78 |
Natural language data accuracy | 89.41% | 90.00% | 86.66% | 88.39% | 89.20% | 87.25% | 88.62% | 86.16% | 92.16% | 88.65% |
Natural language data recall | 88.97% | 95.97% | 88.89% | 89.68% | 93.97% | 94.88% | 94.98% | 92.86% | 90.77% | 92.33% |
Amount of dirty data in the time series data model | 82 | 86 | 32 | 187 | 278 | 115 | 411 | 96 | 48 | 1335 |
Number of error bars in the time series data model | 8 | 8 | 2 | 23 | 33 | 10 | 52 | 11 | 4 | 151 |
The correct number of time series data models | 74 | 78 | 30 | 164 | 245 | 105 | 359 | 85 | 44 | 1184 |
Time series data model missed check | 6 | 6 | 3 | 12 | 19 | 8 | 27 | 6 | 4 | 91 |
Time series data model precision | 89.11% | 89.51% | 89.96% | 87.22% | 87.63% | 89.75% | 85.22% | 88.21% | 91.51% | 88.68% |
Time series data model recall | 92.96% | 93.11% | 92.92% | 93.52% | 90.54% | 94.23% | 92.48% | 93.54% | 92.46% | 92.86% |
The amount of dirty data in this algorithm | 154 | 102 | 45 | 227 | 344 | 174 | 456 | 283 | 61 | 1846 |
The number of errors in the algorithm in this study | 11 | 6 | 2 | 19 | 19 | 8 | 23 | 20 | 5 | 113 |
The correct number of algorithms in this study | 143 | 96 | 43 | 208 | 325 | 166 | 433 | 263 | 56 | 1733 |
The algorithm in this study misses the check | 2 | 3 | 1 | 3 | 8 | 6 | 8 | 4 | 2 | 37 |
The precision of the algorithm in this study | 92.99% | 94.52% | 96.12% | 91.48% | 94.32% | 95.23% | 94.84% | 93.25% | 92.18% | 93.88% |
The recall rate of the algorithm in this study | 98.81% | 97.63% | 98.71% | 98.81% | 97.18% | 96.77% | 97.96% | 98.65% | 96.93% | 97.90% |
Cleaning Technology | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 |
---|---|---|---|---|---|---|---|---|
Levenshtein Distance | 0.667 | 0.692 | 0.716 | 0.73 | 0.747 | 0.754 | 0.768 | 0.78 |
S-W | 0.7 | 0.739 | 0.757 | 0.775 | 0.784 | 0.788 | 0.796 | 0.807 |
This paper proposes a cleaning algorithm | 0.875 | 0.889 | 0.894 | 0.919 | 0.929 | 0.945 | 0.957 | 0.891 |
Cleaning Technology | 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 |
---|---|---|---|---|---|---|---|---|
Levenshtein Distance | 0.687 | 0.711 | 0.715 | 0.727 | 0.735 | 0.742 | 0.759 | 0.764 |
S-W | 0.727 | 0.731 | 0.738 | 0.745 | 0.757 | 0.762 | 0.775 | 0.782 |
This paper proposes a cleaning algorithm | 0.842 | 0.857 | 0.864 | 0.873 | 0.883 | 0.895 | 0.911 | 0.911 |
Evaluation Parameters | Total Waste Stock | Engineering Muck | Engineering Mud | Engineering Waste | Demolition Garbage | |||||
---|---|---|---|---|---|---|---|---|---|---|
Algorithm of This Paper | LSTM | Algorithm of This Paper | LSTM | Algorithm of This Paper | LSTM | Algorithm of This Paper | LSTM | Algorithm of This Paper | LSTM | |
RMSE | 11.59 | 20.85 | 13.24 | 25.85 | 12.59 | 26.84 | 11.51 | 20.86 | 9.597 | 18.85 |
MAP | 8.84 | 16.24 | 9.84 | 18.24 | 10.86 | 19.74 | 8.94 | 16.24 | 6.84 | 14.24 |
Evaluation Parameters | Fill Method | |
---|---|---|
Average RMSE | Algorithm of this paper: 11.708 | LSTM: 22.653 |
Average MAPE | Algorithm of this paper: 9.064% | LSTM: 16.942% |
N-gram Algorithm | Natural Language Data Cleaning Model | Multi-Algorithm Constraint Model + Time Series Data Cleaning Model | Natural Language Precision | Natural Language Recall | Time Series Data Precision | Time Series Data Recall |
× | × | 81.73% | 84.56% | N/A | N/A | |
√ | × | 88.65% | 92.33% | N/A | N/A | |
√ | √ | 94.87% | 97.90% | 97.13% | 98.67% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, P.; Liu, Y.; Sun, Q.; Bai, Y.; Li, C. Research on Data Cleaning Algorithm Based on Multi Type Construction Waste. Sustainability 2022, 14, 12286. https://doi.org/10.3390/su141912286
Wang P, Liu Y, Sun Q, Bai Y, Li C. Research on Data Cleaning Algorithm Based on Multi Type Construction Waste. Sustainability. 2022; 14(19):12286. https://doi.org/10.3390/su141912286
Chicago/Turabian StyleWang, Pengfei, Yang Liu, Qinqin Sun, Yingqi Bai, and Chaopeng Li. 2022. "Research on Data Cleaning Algorithm Based on Multi Type Construction Waste" Sustainability 14, no. 19: 12286. https://doi.org/10.3390/su141912286
APA StyleWang, P., Liu, Y., Sun, Q., Bai, Y., & Li, C. (2022). Research on Data Cleaning Algorithm Based on Multi Type Construction Waste. Sustainability, 14(19), 12286. https://doi.org/10.3390/su141912286