1. Summary
The assessment of environmental health can be accomplished by considering five key factors: soil, water, climate, natural vegetation, and landscapes. Out of these elements, water plays the most critical role in supporting human life and the survival of various ecosystems [
1]. Its importance extends to drinking, household use, food production, and recreation, making safe and clean water an essential requirement for public health [
2]. Therefore, it is vital to maintain proper water quality for preventing significant harm to human well-being and for maintaining an ecological balance for other species [
3]. Water pollution, a significant global problem, requires ongoing evaluation and international efforts to effectively manage water resources, from a broader perspective to individual wells. Numerous studies have shown that water pollution is a leading cause of death and illness worldwide, resulting in numerous daily fatalities [
4]. In numerous developing nations, untreated or contaminated water is consumed due to public and administrative ignorance, coupled with the absence of a water quality monitoring system, leading to severe health complications [
5,
6]. Predicting and providing early warnings regarding pollution or declining water quality can serve as effective preventive measures that can be implemented promptly, especially in Thailand.
In Thailand, the Metropolitan Waterworks Authority (MWA) has the primary responsibility to supply and distribute water to various regions. They achieve this by mainly utilizing the raw water resources from the Chao Phraya River. Several research studies have been carried out in cooperation with the Metropolitan Waterworks Authority (MWA) on raw water quality management in the Chao Phraya River. Ref. [
7] suggested innovative approaches for managing the saltwater influx at the Samlae Water Pumping Station, which serves as the primary intake station for the MWA (Metropolitan Waterworks Authority) from the Chao Phraya River. Additionally, they emphasized the need for enhanced cooperation from the MWA in water resource management efforts. Ref. [
8] concluded that the urban water supply systems generally met water quality standards, except for color and iron issues caused by sedimentation process problems and iron pipe presence. Ref. [
9] proposed the one-dimensional simulation flow model that proved valuable for optimizing water management, and enabling energy savings and efficient emergency water discharge planning in the West Water Canal. Ref. [
10] identifies the optimal coagulants, their respective dosages, and cost efficiencies for effectively treating raw water with low, normal, and high turbidity levels, enabling the Metropolitan Waterworks Authority to meet water quality standards with greater clarity and cost-effectiveness. Based on the previous studies that have been conducted, the focus is only on developing methods for distributing and monitoring the water quality of the Chao Phraya River. However, a complete dataset concerning the water quality conditions of the river, especially in English, is still lacking. Therefore, there is a recognized need to provide comprehensive water quality datasets for researchers. The availability of such data will not only benefit researchers but also contribute to the sustainable development and protection of this vital natural resource in Thailand. To ensure the availability of this dataset in real time, which can be accessed anytime and anywhere, the selection of the right technology is very important.
This paper employs Internet of Things (IoT) technology to collect data, enabling the monitoring of water quality through sensors immersed in water. By employing diverse sensors, this system captures various essential parameters from the water. The rapid advancement of Wireless Sensor Network (WSN) technology has revolutionized real-time data acquisition, transmission, and processing, allowing users to access sustainable water quality information remotely. IoT has become a groundbreaking phenomenon with applications spanning various fields, including smart cities, smart power grids, smart supply chains, and smart wearables [
11]. Although IoT is yet to reach its full potential in environmental applications, it offers immense opportunities. It can be utilized for detecting forest fires and early earthquakes, reducing air pollution, monitoring green houses, preventing landslides, and most importantly, for water quality monitoring and control systems [
12,
13,
14,
15,
16,
17,
18,
19,
20]. In the twenty-first century, researchers have placed considerable emphasis on monitoring water quality, leading to numerous ongoing projects that explore various aspects of this field. The main objective of these research is to create a monitoring system for water quality that is both efficient and cost-effective, while also providing real-time data. This system would integrate wireless sensor networks and the Internet of Things (IoT), enabling comprehensive monitoring of water quality parameters [
21]. In addition to monitoring systems, another crucial focus is ensuring the availability of datasets for researchers. These datasets are essential for developing artificial intelligence models aimed at predicting and preventing disasters related to water quality.
Therefore, in this study, a dataset comprising daily river water quality measurements collected from six stations along the Chao Phraya River in Thailand was presented. The dataset was obtained using IoT technology, specifically the Eureka Water Probe Manta +35 sensors deployed at each station, enabling accurate real-time monitoring of river water conditions. The Eureka Manta water quality multiprobe underwent rigorous testing at the U.S. Geological Survey (USGS) Hydrologic Instrumentation Facility to assess its accuracy and compliance with standards, including ISO 7027 [
22] for measuring turbidity and Standard Methods 2510 B to correct the specific conductance. The results demonstrated that the Manta met the criteria outlined in the USGS National Field Manual for continuous water quality monitors, covering parameters such as dissolved oxygen and turbidity.
The sensor measurements encompass parameters such as Turbidity (TURB_NTU), Optical Dissolved Oxygen (HDO), Dissolved Oxygen Saturation (DO_SAT), Spatial Conductivity (SPCOND), Acidity/Basicity (pH), Total Dissolved Solids (TDS), Salinity (SALINITY), Temperature (TEMP), Chlorophyll (CHL), and Depth (DEPTH). Water conditions are recorded and stored in a MySQL database at 10-min intervals. The available dataset can serve a wide range of applications, encompassing a trend analysis, heat flux calculations, calibration/validation of water temperature models based on processes, establishing baseline conditions for future climate projects, analyzing climate drivers, assessing impacts on ecosystem health, and evaluating water quality.
Previous studies have collected river water quality data with varying recording parameters, timings, and locations as summarized in
Table 1, which highlights the importance of conducting this research update. Hence, the specific objectives of this study are as follows: (1) collecting a novel dataset on water quality in Thailand, specifically along six points of the Chao Phraya River, and (2) employing the Long Short-Term Memory (LSTM) model, in order to evaluate the quality of our proposed dataset.
2. Dataset Description
The collected dataset contains various information such as error logs, wipe schedules, and sensor logs, therefore filtering data is carried out first to separate sensor logs from other data. The data obtained from the database are in the form of MySQL (.sql) files that are then filtered and converted into Comma Separated Values (.csv), which separates data between stations. The naming of “XX Logs.csv” was performed to make it easier to categorize by station, where XX is the station ID. There are six CSV dataset files named s1 Logs.csv, s2 Logs.csv, s3 Logs.csv, s4 Logs.csv, s5 Logs.csv, and s15 Logs.csv. In the dataset, the comma symbol (,) is employed as a separator between columns, while the dot symbol (.) is utilized to indicate decimal values. The initial row of the CSV file includes the titles for each data column, which can be observed in
Figure 1. Additionally, the distribution of our dataset for station 1 is illustrated in
Figure 2a–j. The turbidity (NTU) of water should be lower for better clarity, while a higher optical dissolved oxygen (HDO) level is desirable. Lower values of Spatial Conductivity (SPCOND) indicate less saltiness, and the pH range of water should ideally be between 6.5 and 8.5. Lower total dissolved solids (TDS) below 1000 are preferable. Salinity represents the dissolved salt content of a body of water, and the temperature typically falls within the range of 43 to 68 degrees Fahrenheit.
The dataset consists of 16 columns with ID values as the primary one and station_id as markers for each station.
Table 2 describes each of the water quality parameters’ data collected from sensors at each station. The dataset obtained still contains noise in the form of lost values due to disconnected internet connections, therefore data cleansing is carried out using formula (1) where
is noise data, and
i−1 and
i+1 correspond to the previous and next valid measurements relative to the missing data point
i.
Data preprocessing holds significant importance within the data analysis and machine learning pipeline. It encompasses the identification and rectification of errors, inconsistencies, and inaccuracies in a dataset to enhance its quality and reliability. In the current scenario, the provided datasets were collected from six distinct water stations, which has introduced inconsistencies in the data formats. To address the issue of missing data, the standard data range was outlined in
Table 2. This step ensures that the dataset remains consistent and reliable for further analysis.
Table 3 shows the distribution of the sum, mean, standard deviation, minimum value, and maximum value for Station 1 after the data preprocessing step. The correlation between the collected sensor parameters is presented in
Figure 3. One important relationship is between hdo_sat and hdo, which demonstrates a close correlation because the value of dissolved oxygen in units of mg/l is converted to a percentage (%) referred to as dissolved oxygen saturation. Additionally, spcond exhibits a close correlation with both tds and salinity, indicating their interdependence. Interestingly, tds, salinity, and spcond are negatively correlated with both hdo and hdo_sat, suggesting that an increase in these variables may result in a decrease in water quality. On the other hand, variables such as turb_ntu, pH, chl, and temp exhibit low to medium correlations with other variables, implying that their impact on water quality might be more nuanced and influenced by additional factors. Understanding these interconnections aids in comprehending the complex dynamics of water quality assessment and management. For more details,
Figure 4 presents a graph showing the correlation between the Spatial Conductivity, TDS, and Salinity parameters, as these three variables influence each other.
Figure 5 displays the correlation between HDO and HDO Saturation values, where HDO influences HDO Saturation.
4. Dataset Experiments and Evaluation
In this study, LSTM is used to evaluate water quality datasets, especially those involving time series data. LSTM is an algorithm developed from the Recurrent Neural Network (RNN), and this algorithm is designed based on traditional RNN problems related to explosions and the loss of gradients from data stored for a long time [
30]. The significant difference seen in the standard RNN structure with the LSTM is the number of repeating modules. Standard RNN has a simple structure, for example, RNN only has one tanh layer, whereas LSTM has more than one tanh layer and they interact in a unique way [
30].
Figure 12 shows the three main parts of the LSTM architecture, namely Forget, Input, and Output Gate (FG, IG, OG). In calculations (10)–(15) it can be seen that
(which is output) and
(which is input) are inputs from FG, IG, Cell Update, and OG at time
t.
LSTM’s ability to capture long-term dependencies and handle sequential data makes it suitable for analyzing and predicting water quality parameters over time. The dataset used has gone through the process of data preparation and pre-processing. At this experiment and evaluation stage, data obtained at the S1-Sam Lae station were used. The dataset is then divided into two for training and testing purposes as shown in
Figure 13. The training dataset spanned from July 2022 to December 2022 (75%), while the testing dataset covered the period from January 2023 to February 2023 (25%).
The parameter settings used in this study can be seen in
Table 6, where Adam is used as the optimizer algorithm. We conducted an assessment of the LSTM model’s predictive capabilities over a 45-day horizon, revealing its accurate prediction of a 10-day span. This outcome precisely corresponds to a calculated accuracy of 22.2%. In this scenario, the error rate is exceptionally high, leading to a correspondingly low level of accuracy achieved, which is because fine-tuning of the model has not been carried out yet. Evaluation of the LSTM model trained using test data is carried out by calculating performance metrics such as mean squared error (MSE) and root mean squared error (RMSE) to assess model accuracy in predicting water quality parameters. Statistical results for evaluating turbidity predictions can be seen in
Figure 14 and
Table 7. The MSE and RMSE metrics are in common use and are especially suitable when the underlying data distribution follows a Gaussian behavior assuming normality of the data in this research. While the choice of MSE and RMSE is reasonable based on Gaussian assumptions, it is important to recognize that real-world datasets may exhibit deviations from this ideal distribution. Ref. [
31] provides illustrative examples of situations where the data behavior deviates from normality. This reference highlights the importance of considering non-Gaussian behavior in practical applications, particularly in the context of water analysis and risk assessment. This can be material for further research for other researchers.
5. Conclusions
In this study, a novel dataset has been successfully gathered, comprising observations from six strategically positioned stations along the Chao Phraya River in Thailand. To obtain a comprehensive understanding of the river’s water quality conditions, ten key parameters were recorded using sensors. These parameters include Turbidity, Optical Dissolved Oxygen, Dissolved Oxygen Saturation, Spatial Conductivity, pH level, Total Dissolved Solids, Salinity, Temperature, Chlorophyll, and Depth. Water clarity is enhanced when turbidity (NTU) levels are lower, and higher optical dissolved oxygen (HDO) levels are preferable. Lower conductivity (SPCOND) values indicate reduced saltiness and maintaining a pH range between 6.5 and 8.5 is ideal. Moreover, lower total dissolved solids (TDS) below 1000 are preferable, while salinity reflects the dissolved salt content of water. Additionally, the temperature typically falls within the range of 43 to 68 degrees Fahrenheit. Based on the dataset distribution that has been collected, water quality standards were found to be met in some parameters of the Chao Phraya River, including HDO, pH, and TDS.
There are also correlations between Spatial Conductivity, TDS, and Salinity parameters, and they influence each other. Similarly, the relationship between HDO and HDO Saturation values indicates that these parameters are also influencing each other. For a comprehensive understanding, further exploration of these correlations is recommended. After the data collection phase, data preprocessing and evaluation has been performed on the dataset. Based on the parameters observed in the proposed dataset, it can be seen that the quality of water along the Chao Phraya River is good between August and November, likely due to the rainy season in Thailand. During the evaluation process, a deep learning LSTM model was employed, which exhibited suboptimal accuracy in predicting water quality. However, this dataset holds immense potential as a valuable resource for future research endeavors focused on monitoring water quality and establishing early warning systems for pollution-related disasters in Thailand. The insights from this study provide a foundation for advancing our understanding and management of water quality in the region.