Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring
Abstract
:Featured Application
Abstract
1. Introduction
2. Method Principle
2.1. Distributed Streaming Data Processing
2.2. Density Based Clustering Algorithm
2.3. DBSCAN Algorithm
- (1)
- Core point: Taking the point k as the center and Eps as the radius, there exists {x1, x2, …, xn}. If > Minpts, it means that the density attribute of point k meets the requirements, and the point k will be recorded as the core point of a certain cluster. In this sense, all points in the range of Eps from point k will be ‘density reachable’.
- (2)
- Boundary point: Taking the point k as the center and Eps as the radius, there exists {x1, x2, …, xn}. If < MinPts and point k is point reachable by the direct density of a point within the radius of Eps, point k will be a boundary point.
- (3)
- Noise point: Taking the point k as the center and Eps as the radius, there exists {x1, x2, …, xn}. If < MinPts and there is no core point in the Eps range, point k will be recorded as the noise point.
2.4. Flink
- (1)
- Batch and stream processing:
- (2)
- Overview of system architecture:
- (1)
- Layered architecture
- (2)
- Features of Flink
3. Stream-DBSCAN Water Quality Detection Clustering Model
3.1. Water Quality Data Sampling
3.2. Water Quality Data Preprocessing
3.3. Build the Model
- (1)
- The Stream-DBSCAN model utilized experimental data from the Menlou Reservoir between May to August 2019, focusing on three water quality indicators: NH4N, pH, and turbidity. Since the magnitude of each dimension in the experimental data varied significantly, the model first standardized and sampled the data to ensure that the results considered the impact of each dimension while avoiding redundancy in similar data operations.
- (2)
- To ensure the streaming nature of the dataset, the Stream-DBSCAN model employs Kafka to convert sensor-generated data into a continuous data stream prior to clustering. The stream processor in Kafka can continually collect the data stream, apply built-in processing to adjust it, and output the stream in real-time. Within this framework, Kafka combines data generated by the NH4N, pH, and turbidity sensors based on a common timestamp attribute before transmitting the data stream to the Flink framework. By utilizing Kafka, the model benefits from high throughput and low latency, allowing it to process hundreds of thousands of data points per second with only a few microseconds of delay.
- (3)
- During the data partitioning stage, it is crucial to consider whether the partitioning results will result in larger clusters during the clustering process and decrease the number of noise points. Once Kafka passes the data stream into Flink, the DataStream must be converted into KeyedStream using keyBy for data partitioning. To ensure the partitioning is reasonable in this model, we implement the K-means algorithm [28] to perform rough clustering of the dataset [29], dividing data with similar values into the same KeyedStream. If the data is directly inputted into each node without rough clustering, numerous small-scale clusters will appear in the clustering results, and some points that should be in the cluster will be misjudged as noise points, resulting in rejection before cluster merging [30]. Considering that the dataset’s size is not too significant, this model opts to divide it into three. The rough clustering process of the dataset using the K-means algorithm is as follows: Using K cluster centroid points μ1, μ2, ……, μK, the following clustering process occurs for each 3D data point η𝑖:Among them, is the class closest to point i among the K classes, and the value range of is 1–K. is the data point to be determined, and represents each centroid point.For the K generated clusters, their centroids μ𝑖 have the following iterative formulas:The Stream-DBSCAN model utilizes the K-means algorithm to coarsely cluster data, yielding K data streams that exhibit internal similarities and external differences.
- (4)
- During the process of distributed clustering, the selection of data parallelism K is crucial in determining whether the processing time is acceptable. Therefore, the K value is adjusted according to the size of the data, and the selection of K should strike a balance between the computing power of the computer and the time-consuming curve of the clustering algorithm. An optimal value of K will control the data processing time of each computer node within a manageable time cost. In the case of the three data streams generated in process (3), the Stream-DBSCAN model employs three independent nodes to perform parallel operations on the data streams. Three tasks are set up in Flink to receive the KeyedStream generated after the partition. Each node is assigned a task, and the DBSCAN algorithm is utilized to cluster the dataset in each node. When the DBSCAN algorithm runs on each node, it chooses appropriate values of Eps and Minpts based on the number of data and the rate of change of the value. After the node completes the calculation task, the result is returned to the master node. At this point, the distributed clustering phase is complete.
- (5)
- The distributed clustering process generates three sets of clustered sub-results that need to be merged in this stage. To accomplish this, it is necessary to examine whether the boundaries of each cluster in the three clustering results produced by the DBSCAN algorithm can be merged. After receiving the three clustering results from the DBSCAN algorithm, the master node merges them as the final step in the entire Stream-DBSCAN model. The merge calculation for the entire dataset should be simple and efficient to maintain the processing speed of the entire model and prevent distributed computation from becoming meaningless. To effectively merge the clustering results from the three nodes and minimize the processing time, the Stream-DBSCAN model utilizes the overlapping and merging rule for clustering results, as shown in the figure below:
- (i)
- (ii)
- Find the edge points closest to the centroid point of the other cluster in cluster A and cluster B, respectively, as shown in Figure 3c, for two points a and b.
- (iii)
- Calculate the distances, LAb and LBa from the edge points in cluster A and cluster B to the centroid of another cluster (Figure 3c).
- (iv)
- Compare the size relationship between LAb, LBa, and Eps, respectively, if LAb, LBa < Eps, then merge the two clusters A and B.
4. Experiment and Analysis
4.1. Data Analysis
4.2. Clustering Results
4.3. Data Relationship Analysis
4.4. Distributed Time Consumption Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Falkenmark, M.; Lundqvist, J. Comprehensive assessment of the freshwater resources of the world. In World Freshwater Problems—Call for A New Realism; Stockholm Environment Institute: Stockholm, Sweden, 1997. [Google Scholar]
- Fengchun, L. Improving the design level of water conservancy planning by using the concept of water resources sustainable development. Heilongjiang Sci. 2017, 8, 170–171. [Google Scholar]
- Zhao, Z.N.; Tian, Y.; Zhang, Y.; Yuan, Y.; Luo, P.; Huang, H.J.; Wang, J. Analysis of Connotation and Current Situation of Water Resources Risks in China. Yellowriver 2019, 41, 46–50. [Google Scholar]
- Dellana, S.A.; West, D. Predictive modeling for wastewater applications: Linear and nonlinear approaches. Environ. Model. Softw. 2009, 24, 96–106. [Google Scholar] [CrossRef]
- Deng, R.; Wei, S.N. Sewage Quality Prediction Based on LSTM Neural Network and DBSCAN Algorithm. Comput. Telecommun. 2021, 4, 66–73. [Google Scholar]
- Liu, J.; Zhu, R.J.; Jiang, D.X.; Wang, D.W.; Xu, C.P.; Nan, J.; Wang, P. Real-time water quality prediction model based on IGA-BPNN method. South-North Water Transf. Water Sci. Technol. 2020, 18, 93–100. [Google Scholar]
- Guo, X.J.; Song, J.G.; Han, Y.-M. Water Environ mental Capacit y of a Reservoir in Yantai. Environ. Sci. Technol. 2006, 29, 43–45. [Google Scholar]
- Jiang, Z.; Huang, J. Research on Information System Construction of Menlou Reservoir in Yantai. China Water Power Electrif. 2019, 6, 3–8. [Google Scholar]
- Wang, Y.; Jiang, X. Water Pollution Investigation and Water Quality Model Establishment for Menlou Reservoir in Yantai. Environ. Sci. Manag. 2015, 40, 173–176. [Google Scholar]
- Ma, J.; Jiang, W. The Concept, Characteristics and Application of Big Data. Natl. Def. Sci. Technol. 2013, 34, 10–17. [Google Scholar]
- Zhang, X. Research on Effective Technology to Improve the Accuracy and Stability of Water Quality Testing Results. Shaanxi Water Resour. 2021, 39, 108–111. [Google Scholar]
- Zhou, W. Analysis of Water Quality Influencing Factorsand Water Quality Prediction in the Three Gorges Reservoir Area. Master’s Thesis, Chongqing Jiaotong University, Chongqing, China, 2020. [Google Scholar]
- Zhao, J.; Wei, S.; Wen, X.; Qiu, X. Analysis and prediction of big stream data in real-time water quality monitoring system. J. Ambient. Intell. Smart Environ. 2020, 12, 393–406. [Google Scholar] [CrossRef]
- Di, Z.; Chang, M.; Guo, P.; Li, Y.; Chang, Y. Using real-time data and unsupervised machine learning techniques to study large-scale spatio-temporal characteristics of wastewater discharges and their influence on surface water quality in the yangtze river basin. Water 2019, 11, 1268. [Google Scholar] [CrossRef]
- Mandel, P.; Wang, Y.; Parre, A.; Féliers, C.; Heim, V. Quality zones automatically identified in water distribution networks by applying data clustering methods to conductivity measurements. Water Res. 2021, 207, 117716. [Google Scholar] [CrossRef]
- Vries, D.; van den Akker, B.; Vonk, E.; de Jong, W.; van Summeren, J. Application of machine learning techniques to predict anomalies in water supply networks. Water Sci. Technol. Water Supply 2016, 16, 1528–1535. [Google Scholar] [CrossRef]
- Wu, X.; Zhu, X.; Wu, G.-Q.; Ding, W. Data mining with big data. IEEE Trans. Knowl. Data Eng. 2014, 26, 97–107. [Google Scholar]
- Storey, V.; Song, I. Big data technologies and management: What conceptual modeling can do. Data Knowl. Eng. 2017, 108, 50–62. [Google Scholar] [CrossRef]
- Arora, P.; Deepali, D.; Varshney, S. Analysis of K-Means and K-Medoids algorithm for big data. Procedia Comput. Sci. 2016, 78, 507–512. [Google Scholar] [CrossRef]
- Huang, W.; Meng, L.; Zhang, D.; Zhang, W. In-memory parallel processing of massive remotely sensed data using an apache spark on hadoop yarn model. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 3–19. [Google Scholar] [CrossRef]
- Shirkhorshidi, A.S.; Aghabozorgi, S.; Wah, T.Y.; Herawan, T. Big Data Clustering: A Review; ICCSA 2014; Springer: Cham, Switzerland, 2014. [Google Scholar]
- Cheng, D.; Zhu, Q.; Huang, J.; Wu, Q.; Yang, L. Clustering with Local Density Peaks-Based Minimum Spanning Tree. IEEE Trans. Knowl. Data Eng. 2021, 33, 374–387. [Google Scholar] [CrossRef]
- Du, M.; Zhao, J.; Sun, J.; Dong, Y. M3W: Multistep three-way clustering. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–14. [Google Scholar] [CrossRef]
- Li, S.S. An Improved DBSCAN Algorithm Based on the Neighbor Similarity and Fast Nearest Neighbor Query. IEEE Access 2020, 8, 47468–47476. [Google Scholar] [CrossRef]
- Shi, A.; Yin, J.; Fan, P. Spark Parallelization Improved SDKB-DBSCAN Clustering Algorithm. Mod. Comput. 2021, 14, 14–20+37. [Google Scholar]
- Pule, M.; Yahya, A.; Chuma, J. Wireless sensor networks: A survey on monitoring water quality. J. Appl. Res. Technol. 2017, 15, 568–570. [Google Scholar] [CrossRef]
- Adu-Manu, K.S.; Tapparello, C.; Heinzelman, W.; Katsriku, F.A.; Abdulai, J.-D. Water Quality Monitoring Using Wireless Sensor Networks: Current Trends and Future Research Directions. ACM Trans. Sens. Netw. 2017, 13, 1–41. [Google Scholar] [CrossRef]
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. J. R. Stat. Soc. 1979, 28, 100–108. [Google Scholar] [CrossRef]
- Gholizadeh, N.; Saadatfar, H.; Hanafi, N. K-DBSCAN: An improved DBSCAN algorithm for big data. J. Supercomput. 2021, 77, 6214–6235. [Google Scholar] [CrossRef]
- Mo, Y. Design and Implementation of a Water Quality Monitoring System Server Side. Master’s Thesis, Huazhong University of Science and Technology, Wuhan, China, 2015. [Google Scholar]
Storm | Spark Streaming | Flink | |
---|---|---|---|
Streaming Model | Native | Mini-batch | Native |
Consistency assurance | At Least/Most Once | Exactly Once | Exactly Once |
delay | Low latency (in milliseconds) | High latency (seconds) | Low latency (in milliseconds) |
swallow and spit | LOW | High | High |
fault-tolerant | ACK | RDD Based Checkpoint | Checkpoint (Chandy Lamport) |
StateFul | No | Yes (Dstream) | Yes (Operator) |
SQL support | No | Yes | Yes |
Index | Totle | S.D. | Avg | Min | Max |
---|---|---|---|---|---|
PH | 200,000 | 0.365684062 | 8.4248135 | 0 | 10.68473 |
NH4N | 200,000 | 44.60782894 | 3.2476836 | 0 | 1000.000 |
Turbidity | 200,000 | 92.26919696 | 39.706503 | 0 | 1000.000 |
Cluster | Number |
---|---|
I | 147 |
II | 1899 |
III | 970 |
IV | 1198 |
V | 163,624 |
Noise | 32,159 |
Index | Totle | S.D. | Avg | Min | Max |
---|---|---|---|---|---|
pH | 167,840 | 0.261720083 | 8.4619434 | 6.093347 | 9.175494 |
NH4N | 167,840 | 29.57231904 | 0.72258126 | 0 | 4.078737 |
Turbidity | 167,840 | 8.499657004 | 20.709381 | 0.52096874 | 61.88089 |
Cluster | Time | Quality |
---|---|---|
I | p.m. 8:00–p.m. 10:00 | 4 |
II | p.m. 6:00–p.m. 8:00 | 3 |
III | a.m. 12:00–p.m. 3:00 | 2 |
IV | a.m. 6:00–a.m. 8:00 | 1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mu, C.; Hou, Y.; Zhao, J.; Wei, S.; Wu, Y. Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring. Appl. Sci. 2023, 13, 5408. https://doi.org/10.3390/app13095408
Mu C, Hou Y, Zhao J, Wei S, Wu Y. Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring. Applied Sciences. 2023; 13(9):5408. https://doi.org/10.3390/app13095408
Chicago/Turabian StyleMu, Chunxiao, Yanchen Hou, Jindong Zhao, Shouke Wei, and Yuxuan Wu. 2023. "Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring" Applied Sciences 13, no. 9: 5408. https://doi.org/10.3390/app13095408
APA StyleMu, C., Hou, Y., Zhao, J., Wei, S., & Wu, Y. (2023). Stream-DBSCAN: A Streaming Distributed Clustering Model for Water Quality Monitoring. Applied Sciences, 13(9), 5408. https://doi.org/10.3390/app13095408