Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing

Chen, Chenyu; Luo, Meiyu; Wang, Wenyu; Ping, Yang; Li, Hongming; Chen, Siyuan; Liang, Qian

doi:10.3390/w17030343

Open AccessArticle

Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing

by

Chenyu Chen

¹,

Meiyu Luo

^2,3,*,

Wenyu Wang

²,

Yang Ping

¹,

Hongming Li

⁴,

Siyuan Chen

⁴ and

Qian Liang

⁴

¹

Power China Eco-Environmental Group Co., Ltd., Shenzhen 518101, China

²

School of Environmental Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China

³

Shenzhen Zhishu Environmental Technology Co., Ltd., Shenzhen 518055, China

⁴

PowerChina Water Environmental Technology Co., Ltd., Shenzhen 518101, China

^*

Author to whom correspondence should be addressed.

Water 2025, 17(3), 343; https://doi.org/10.3390/w17030343

Submission received: 14 October 2024 / Revised: 23 December 2024 / Accepted: 16 January 2025 / Published: 25 January 2025

(This article belongs to the Special Issue Advancing the Monitoring and Modelling of Freshwater Systems with New Remote Sensing Technologies)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Ultraviolet-visible (UV–Vis) absorption spectroscopy for in situ water quality sensing has garnered increasing attention. However, the selection of the characteristic wavelengths for water quality indicators has been underexplored in existing studies, resulting in surrogate monitoring models with low accuracy and high complexity. This research used field data from the Maozhou River in Shenzhen. The accuracy of the surrogate model based on the wavelength selection method is 134.8%, 52.5%, and 13.5% improvement in accuracy compared to the single wavelength method, the PCA method, and the full spectrum method, respectively. We investigate seven characteristic wavelength optimisation algorithms and five machine learning models for surrogate monitoring of five water quality indicators: TOC, BOD₅, COD, TN, and NO₃-N. The results indicate that the competitive adaptive reweighted sampling (CARS) method for wavelength selection, combined with ridge regression as a surrogate monitoring model, achieved the best performance in this study. The determination coefficient (R²) of the five water quality indicators were 0.80, 0.64, 0.82, 0.97, and 0.96, respectively. The study shows that for watersheds with relatively stable water chemical components, there is no need to use overly complex nonlinear models, and the regression model with characteristic wavelength selection can achieve good prediction results. This study provides detailed technical information on river water quality spectral surrogate monitoring, offering an important practice reference.

Keywords:

competitive adaptive reweight sampling method; machine learning model; Maozhou River; spectrometry water quality monitoring; wavelength selection

1. Introduction

Water quality monitoring is essential for the protection and management of aquatic ecosystems. Traditional water quality monitoring methods, including chemical methods, electrode methods, and chemical spectrometry, offer high measurement accuracy but suffer from drawbacks, such as long testing times, reagent waste and pollution [1]. With the development of sensor technology such as spectroscopy, hyperspectral methods, biological methods, and fluorescence methods, new water quality detection equipment is emerging. Among them, UV–Vis spectroscopy detection technology has the advantages of rapid detection, multifunctionality, low operational costs, and the ability to perform real-time online monitoring compared to other detection methods [2,3].

Research on UV–Vis spectroscopy for the quantitative analysis of water quality indicators dates back to the mid-20th century. In particular, Bastian (1957) demonstrated that the concentration and absorbance of nitrate measured in dilute perchloric acid solutions showed a strong linear relationship at wavelengths of 203 nm, 210 nm, and 220 nm [4]. At the same time, it was found that nitrate has almost no absorption at 275 nm. The nitrate content can be determined more accurately by deducting the absorption at 275 nm [5,6]. Ogura (1968) found a strong correlation between the absorbance at 220 nm (E220) and Chemical Oxygen Demand (COD) in samples of water collected near the coast [7]. Five-day Biochemical Oxygen Demand (BOD₅) is difficult to determine from online monitoring, but it can be estimated from COD and the UV–Vis spectra at 260 nm and 550 nm [8].

The detection principle of UV–Vis spectroscopy technology is that different water quality indicators are absorbed at a specific wavelength of the UV–Vis absorption spectrum. The Beer–Lambert law is used to establish the relationship between the concentration of water quality indicators and the absorbance and finally determine the concentration of water quality indicators. For a pure solution, the absorbance is directly proportional to the concentration of the sample, so the concentration of the sample to be measured can be quantitatively obtained from the absorbance value of the sample at the corresponding absorption wavelength. For natural water bodies, due to the interference of other factors, such as turbidity and pH, surrogate monitoring machine learning models are needed to predict the concentration. However, it is not easy to establish the relationship between absorbance and concentration due to the accuracy limit of spectrometers and the interference of various complex substances in natural water [9]. In recent years, with the development of algorithms and computer technology and the improvement of the accuracy of spectral detection instruments, the research and construction of spectral analytical models for COD, total organic carbon (TOC), BOD₅, and nitrate have been promoted.

The main algorithmic models for the relationship between UV–Vis spectra and water quality concentrations are linear regression (LR), partial least squares (PLS), artificial neural network (ANN), support vector machine (SVM), or their improved versions. For example, Verma used an ANN to predict COD and BOD₅ by combining spectral information with the water quality indicators pH, DO, and TSS measured in situ [10]. Lepot compared LR, PLS, SVM, and the evolutionary algorithm method (EVO) and found that PLS was the best for establishing a link between spectral data and water quality indicator concentrations, SVM was the next best, EVO and LR were the worst, and for predicting concentrations, EVO and LR were the best [11].

Most of the existing studies focused on COD, BOD₅, nitrate nitrogen (NO₃-N), etc., and a large number of relationships between constructed UV–Vis spectral data and COD, BOD₅, and NO₃-N concentrations have emerged, while not much research has been conducted on other water quality indicators.

Most of the algorithm models are based on PLS, SVM, ANN, and other improvements. Due to the differences in water bodies, it is difficult to find a generic optimal model for most water bodies. Previous studies mainly focused on predictive models. Although there have been extensive studies on single-wavelength selection for water quality monitoring, studies on multi-wavelength selection by advanced algorithms are hardly investigated [11]. Principal Component Analysis (PCA) dimensionality reduction methods are used rather than specific water quality indicators to select an effective characteristic wavelength, thereby reducing the interpretation of the model input [12]. However, not all wavelength spectral data are related to the water quality indicators. Inputting the entire spectral data may result in inaccuracies because the absorption characteristics of different substances lead to distinct characteristic wavelengths for different water quality indicators. The input of too much information is accompanied by an increase in the data dimension, which will cause the curse of dimensionality and the data distribution becomes very sparse. Then, the model accuracy decreases [13,14].

Therefore, the selection of characteristic wavelengths should be carried out. This method allows the removal of some irrelevant variables and improves the accuracy and interpretability of the surrogate monitoring model [15]. For example, the wavelengths associated with N-O and N=O as the main informative variables in their prediction of NO₃-N concentrations, while other variables were considered uninformative or confounding variables.

In most studies, information about the selected wavelengths (both their number and locations) is often missing, as the published papers do not provide access to the part of the spectrum that was used [11]. There is a preference for more complex models, which contribute to the complexity of surrogate monitoring [8,12,16]. However, some rivers are chemically stable and have simple compositions, such models are unnecessary. The selection of appropriate models for rivers should be explored.

The purpose of this paper is to select characteristic wavelengths for five water quality monitoring indicators: TOC, BOD₅, COD, total nitrogen (TN), and NO₃-N. By comparing the advantages and disadvantages of the seven characteristic wavelength selection methods, the optimal selection method is identified, and the selected characteristic wavelengths are explained. Using the preferred wavelengths for model training, a comparison of five statistical learning algorithms is conducted. Furthermore, based on the optimal wavelength selection method and prediction model, we compare the difficulty of predicting the five water quality indicators and evaluate the differences in model performance at various points along the river. The study focuses on the Shenzhen Maozhou River, a typical urban river fed by rainfall, with its ecological base flow in the dry season primarily supplied by the discharge from a sewage treatment plant, resulting in relatively stable water chemistry. This river serves as a typical example among many urban rivers. Finally, based on the findings, the paper discusses the limitations in practical applications and outlines expectations for future improvements.

2. Materials and Methods

2.1. Materials

2.1.1. Spectrum Detection Platform Construction

A platform for UV–Vis spectrum water quality surrogate monitoring was assembled (Figure 1a). This platform is primarily divided into the following parts: light source (PXH-5W xenon lamp light, made in Shenzhen, China from Shenzhen Longmeida Company); spectrometer (Ocean Optics USB2000+ microfibre spectrometer, made in New York, NY, USA from Ocean Optics Company); probe (TP300, an immersion probe that measures absorbance from 200 to 750 nm with a step size of 0.4 nm, made in Shenzhen, China from Shenzhen Longmeida Company); and processor (Asus laptop, made in Shenzhen, China from Asus Company). The structural design for industrial applications was also performed (Figure 1b).

The spectrometer should be calibrated before measurement. The first step in the calibration is to obtain the dark spectrum, then to determine the reference spectrum, using deionised water in the presence of a light source, and finally to carry out the measurement of the actual water sample.

2.1.2. Pure Solution Preparation

Prepare pure solutions to obtain the corresponding spectral data and perform linear fitting with concentration. The results of the linear fitting are used as a reference for the results predicted by the model. According to the “National Environmental Quality Standard for Surface Water”(GB3838-2002) and the “Technical Specification for Automatic Surface Water Monitoring”(HJ 915-2017) by the Ministry of Ecology and Environment of the People’s Republic of China, five indicators—TOC, BOD₅, COD, TN, and NO₃-N—were determined [17,18]. According to the “National Environmental Quality Standard for Surface Water”, the standard substance for BOD₅ is a mixed solution of glucose and glutamic acid. The standard substance for COD and TOC is potassium hydrogen phthalate (KHP), and the standard substance for NO₃-N and TN is potassium nitrate. The configuration method is described in Supplementary Information S1. The linear fit between the spectral data and the concentration can be performed in segments, with good results. The pure solution results are shown in Supplementary Information S2.

2.1.3. River Water Samples

The river water samples were collected from the surface water in the Guangming District section of the Maozhou River in Shenzhen (Figure 2). A total of 29 water samples were collected at 27 points, with three samples taken in parallel at point S6. A portion of the samples was determined on the same day after collection using the UV–Vis monitoring platform. Water samples used for the spectral measurement are unfiltered and measured directly. A portion of the samples was sent to the testing institutions, where concentrations were determined using standard methods.

2.2. Methods

2.2.1. Denoising of Spectral Data

The UV–Vis spectral data obtained not only include characteristic information for measurement but also contain irrelevant information, manifested as burrs and fluctuations in certain bands, commonly referred to as noise. Therefore, this noise must be removed before modelling. Common methods for dealing with noise include smoothing denoising [19] and wavelet denoising [20,21]. This paper uses the moving average method to reduce noise and uses SNR and PSNR to evaluate the denoising effect. The comparison of the absorption spectrum before and after noise reduction is shown in Figure 3. Following noise reduction, the data normalisation method is used for processing to reduce the order of magnitude difference in the data and improve the speed of convergence of the model [22].

2.2.2. Candidate Algorithms of Characteristic Wavelength Selection

The wavelength selection analysis is performed after denoising the data. Based on the literature reports, seven wavelength selection methods are selected in this paper.

(1) Extremum method

The extreme point contains some important information. In the spectrogram, the extreme point represents that the water sample has a large absorption at this wavelength that tends to increase with increasing concentration. Therefore, after calculating the extreme points of each sample, mode statistics were carried out, and the first ten modes were selected as the final characteristic wavelengths.

(2) Correlation coefficient method

This method calculates the correlation coefficient between absorbance and the water quality indicator at each wavelength. The larger the correlation coefficient is, the greater the correlation degree between the wavelength and the water quality indicator. The wavelength corresponding to the first 10 correlation coefficients was selected as the final characteristic wavelength for each water quality indicator.

(3) Linear regression coefficient method

This method establishes the ordinary linear regression equation between the full-spectrum wavelength data and the water quality indicator concentration, and the algorithm is optimised by the least square method. The linear regression coefficient between the absorbance of each wavelength and the water quality indicator concentration was obtained, and the top ten wavelengths were selected as the characteristic wavelengths.

(4) Successive projections algorithm

The successive projections algorithm (SPA) is a method for selecting forward characteristic variables. The SPA utilises projective analysis of vectors, which compares the magnitude of projected vectors by projecting them onto other vectors. The vector with the largest projection vector is selected as the candidate vector, and then the final vector is selected according to the modified model. The SPA can eliminate the influence of collinearity among variables and select the variable combination with the least redundant information [23], which is widely used in the selection of the characteristic wavelength of spectral data. In this study, the projection size of the vector is analysed to screen the characteristic wavelength, and the final wavelength is determined by calculating the RMSE value. The wavelength corresponding to the minimum RMSE value is the selected characteristic wavelength.

(5) Competitive adaptive reweighted sampling

Competitive adaptive reweighted sampling (CARS) is a method for variable selection that selects large absolute coefficients in multiple linear regression models such as PLS [24,25]. There are also some applications in the wavelength selection of spectra. First, N = 50 is selected as the number of subsets in Monte Carlo sampling. In each sampling process, a partial least squares model is established by randomly selecting samples. In this study, 4-fold cross-validation is adopted. Then, the regression coefficients of each wavelength of the partial least squares model were calculated, and the wavelength corresponding to the coefficient with a smaller absolute value was removed by the attenuation exponent method. Based on the remaining wavelengths, a new partial least squares model was established by the adaptive weighted sampling method.

(6) Interval partial least squares method

A sample has a total of 1516 spectral wavelength data, and the wavelengths are divided into a number of intervals with basically the same interval length. A partial least squares model [26,27] is established in each interval to calculate the RMSE of each interval, and the minimum RMSE value and corresponding interval are recorded. The interval size is changed, the above operation is repeated, the minimum RMSE values of different interval sizes are compared, and the minimum value is recorded. The corresponding interval is the selected interval, and the wavelength in the interval is selected as the characteristic wavelength.

(7) Mixing method

The single wavelength selection method has certain limitations, especially because the first three structures are too singular, so the extreme value method, correlation coefficient method, and linear regression coefficient method are mixed, and all the selected wavelengths of the three methods are taken as the characteristic wavelengths.

2.2.3. Spectral Surrogate Monitoring Statistical Learning Methods

The characteristic wavelengths selected by CARS were used as model inputs for modelling. Linear regression (LR), ridge regression (RR), polynomial regression (PR), partial least squares regression (PLSR), support vector regression (SVR), and artificial neural network (ANN) models were compared. Twenty repetitions of 4-fold cross-validation were used to divide the training and testing sets, and the decision coefficients were used as evaluation metrics for model comparison. A brief introduction of the six models is given below.

(1) Linear regression

LR is a statistical analysis method that uses regression analysis in mathematical statistics to determine the quantitative interdependent relationships between two or more variables [28], with the following basic expressions:

\begin{matrix} \hat{y} (w, x) = w_{0} {+ w}_{1} x_{1} + \dots + w_{n} x_{n} \end{matrix}

(1)

PR is a kind of linear regression. The main difference between PR and ordinary linear regression is that there are terms of independent variables with degrees greater than 1. The number of terms has a great impact on the accuracy of the model [29].

RR is a biased estimate regression method that is dedicated to linear data analysis. Compared with the traditional least square method, RR adds a regularisation term α. When the dimension of the training samples is smaller than the independent variable, the fitting phenomenon is relatively serious. Using RR can effectively solve the problem [30,31].

\begin{matrix} \min_{w} {‖X w - y‖}_{2}^{2} + α {‖w‖}_{2}^{2} \end{matrix}

(2)

where α is the penalty parameter (≥0).

(2) Partial least squares regression

PLSR is a generalisation of partial least squares in solving regression problems. PLSR summarises the variables in the X matrix as a small set of orthogonal linear latent variables by maximising the covariance between the X matrix and the response variable y. The complexity of the model is controlled by optimising the number of latent variables, and thus overfitting can be minimised [32].

(3) Support vector regression

SVR was first published in 1995 and is a generalisation of support vector machines in solving regression problems [33,34]. The algorithm has many advantages in solving small sample, nonlinear and high-dimensional pattern recognition and can be generalised to other machine learning problems such as function fitting.

(4) Artificial neural network

The BP (back propagation) neural network is a concept proposed by scientists led by Rumelhart and McClelland in 1986 and is a kind of error backpropagation algorithm trained according to a multilayer feedforward neural network. In terms of structure, the BP network has an input layer, hidden layer and output layer. In essence, the BP algorithm calculates the minimum value of the objective function by using the gradient descent method with the squared network error as the objective function [35]. The multilayer perceptron regression model is used in this study. It is built using MLPRegressor in Sklearn, a machine learning library for Python 3.8 [36].

2.2.4. Evaluation Methodology

To compare the advantages and disadvantages of the seven characteristic wavelength selection methods, model predictions are carried out using the PLS model. The selected wavelengths are used as inputs for predicting the concentration of water quality indicators, with the coefficient of determination (R²) and root mean square error (RMSE) serving as evaluation metrics [11]. Under the optimal wavelength selection method, each characteristic wavelength band is obtained as an input parameter for the statistical model of spectral surrogate monitoring. The comparison results of seven wavelength selection methods and five model prediction methods are presented in the form of box plots. Under the optimal prediction model method, the differences in the prediction results of the five water quality indicators are compared, with the determination coefficient and RMSE used as evaluation metrics. The distribution of the mean value of the determination coefficient is shown in a box plot. Spatially, the model is used to conduct variance analysis at different points, and the results are compared and discussed. “The Technical Specification for Acceptance of Water Pollution Source Online Monitoring System (COD_Cr, NH₃-N, etc.)” (HJ 354-2019) was used for further practical assessment [37].

3. Results and Discussion

3.1. Characteristic Wavelength Selection

To demonstrate the advantages of wavelength selection algorithms over other methods, we conducted a comparison of several wavelength processing methods for surrogate monitoring of NO₃-N. That is, the single wavelength at 220 nm, PCA method with the number of 5 components, 45 wavelengths were selected by using CARS method, full spectrum method. After that, we use the RR model as the surrogate model to predict water quality indicator concentrations. The accuracy of the surrogate model based on the wavelength selection method is 134.8%, 52.5%, and 13.5% higher than that of the single wavelength method, the PCA method, and the full spectrum method, respectively, which is shown in Figure 4.

It is clear that wavelength selection methods play a crucial role in improving accuracy. Figure 5 shows the five water quality indicators—TOC, BOD₅, COD, TN, and NO₃-N—along with the comparison results of the characteristic wavelength optimisation algorithms. The competitive adaptive weight sampling (CARS) method demonstrates a significant advantage and is selected as the final characteristic wavelength optimisation algorithm. The number of wavelengths selected for the five water quality indicators were 88, 59, 59, 59, and 45, respectively. Figure 6 shows the distribution of the characteristic wavelengths selected by CARS for water quality indicator prediction. These wavelengths are used as inputs for the prediction model, which forms the basis for constructing a machine learning model for UV–Vis spectral surrogate monitoring of water quality. The results of the optimal parameter selection for the CARS-based water quality indicator prediction model used in the study are provided in Supplementary Information S3.

The selected wavelengths for different water quality indicators are shown in Figure 7, which demonstrates that most of the selected wavelengths are concentrated between 200–320 nm and 600–750 nm. Taking TN as an example, it shows a strong preference for wavelengths near 220 nm, confirming that 220 nm is the characteristic wavelength of nitrate nitrogen. Wavelengths near 320 nm are selected to reduce the interference of organic matter, while wavelengths near 500 nm are selected to remove colour interference. The 600–750 nm range is chosen to mitigate turbidity interference. Based on the number of selected wavelengths, turbidity interference is considered slightly more significant than colour interference. In addition, TOC, BOD₅, and COD exhibit a preference for characteristic wavelengths around 240 nm, which indicates that the organic matter contained in the water sample may have more conjugated double bonds. When wavelengths around 320 nm are more strongly preferred, it indicates a higher presence of conjugated carbonyl or carbonyl groups in the organic matter.

Compared to the results of Broeke et al., NO₃/NO₂ shows the largest absorbance between 200 and 250 nm, COD exhibits the largest absorbance between 250 and 380 nm, and some wavelengths in the 380–450 nm range can be selected to reduce colour interference. The 400–750 nm wavelength range is selected to mitigate interference from TSS and turbidity [38], which is consistent with the characteristic wavelength selection in this study. This study provides more detailed insights into the selection of characteristic wavelengths. The results further validate the correctness of our characteristic wavelength selection algorithm, demonstrating that the dimensionality reduction of wavelength data using the characteristic wavelength optimisation method enhances our understanding of the characteristic wavelengths corresponding to each substance—something that is difficult to achieve with the PCA or the regularisation methods [39].

3.2. Comparison of Surrogate Monitoring Algorithms and Screening of Optimal Algorithms

LR, RR, PR, PLSR, SVR, and ANN were used for the prediction of the concentration of water quality indicators. The final prediction results of each model for each water quality indicator are shown in Figure 8 below. The results indicate that for TOC, the performance of each method is relatively similar, with RR yielding the best results. The R² ranges from 0.74 to 0.86, with a maximum value of 0.92, demonstrating a strong prediction accuracy. For BOD₅, the RR and support vector machine methods performed better, with RR providing the best prediction. The R² values for BOD₅ range from 0.59 to 0.72 (average of 0.64), and the highest value reaches 0.82, indicating a moderate prediction accuracy.

The prediction results for COD show that the three methods—RR, SVR, and ANN—yield similar performance in predicting water quality indicators, although RR is more stable. The R² of COD prediction by the RR method ranges from 0.76 to 0.85, with an average of 0.82 and a maximum of 0.89, indicating good prediction accuracy. All methods perform well in predicting TN, with SVR providing the best results. The R² for TN prediction using SVR ranges from 0.96 to 0.98 (average of 0.97), reaching up to 0.99, demonstrating very high accuracy. RR, PLSR, and SVR perform well across all water quality indicators. For NO₃-N, RR achieved an R² range of approximately 0.94 to 0.97, with an average of 0.96 and a maximum of 0.98, indicating excellent prediction performance.

In the spectral surrogate monitoring modelling of water quality indicators such as TOC, BOD₅, COD, TN, and NO₃-N, it is found that the RR method provides the best prediction performance for most of the water quality indicators, while SVR performs best for TN, yielding similar accuracy to the RR method.

3.3. Performance Differences Among Water Quality Indicators in Spectral Surrogate Monitoring

The established CARS-based RR model is used to predict the concentration of five water quality indicators. Table 1 presents the evaluation of the training and testing sets using MSE, RMSE, and R². Figure 9 shows the difference between actual value and predicted results. The line indicates the concentration of predicted and actual values of water quality indicators, and the bar graph indicates the relative error of the two. Among them, TOC, COD, and TN comply with “The Technical Specification for Acceptance of Water Pollution Source Online Monitoring System (COD_Cr, NH₃-N, etc.)” (HJ 354-2019) in the accuracy, while there are no specific provisions exist for BOD₅ and NO₃-N.

The prediction results for different water quality indicators are shown in Figure 10, with the prediction accuracy ranked as TN > NO₃-N > TOC > COD > BOD₅. From the figure, it can be seen that the prediction results for COD exhibit more outliers, indicating poorer stability compared to the other water quality indicators. The prediction accuracy for BOD, COD, and TOC is relatively lower and not as good as that for TN and NO₃-N.

Since BOD₅ requires 5 days of biochemical incubation to obtain its concentration value, the biochemical process itself has a certain uncertainty, and BOD₅ contains all substances that can be biodegraded and undergo a degradation process utilising oxygen. Thus, the prediction performance of the spectroscopic method for BOD₅ is relatively poor.

COD contains all organic substances and reduced inorganic substances that can be oxidised in water. Organic substances have spectral absorption in the UV–Vis band. Most inorganic substances have absorption in the infrared band, with fewer showing absorption in the UV–Vis band. Therefore, if the concentration of reduced inorganic substances varies significantly between samples, it will have a greater impact on the spectral prediction of COD. In contrast, TOC, which is primarily related to organic substances, does not face the same interference from reduced inorganic substances as COD, leading to better prediction results.

NO₃-N is known to strongly absorb at 220 nm in pure solution. Although actual water samples are influenced by other substances (e.g., some organic matter containing conjugated double bonds at 220 nm), it is better predicted because of its single component compared to BOD₅, COD, TOC, etc.

TN shows better prediction performance than NO₃-N, with the mean value of R² reaching 0.97, which is not consistent with the expected results. TN includes inorganic nitrogen, such as NO₂-N, NO₃-N, and NH₃-N, as well as nitrogen in organic compounds. Ammonia nitrogen does not exhibit a spectral response in the UV–Vis band, so TN is expected to have worse prediction performance than NO₃-N. One possible explanation is the small number of samples size in this experiment, introducing some variability in the results. Another reason may be that the concentration of water quality indicators in the samples is low and affected by noise in the spectral data, although noise reduction has been applied.

3.4. Performance of Surrogate Monitoring Models Across Different River Sections

The relative error of the water quality indicator concentrations predicted by each point is shown in Figure 11. The points with poor prediction performance are S10 and S11. For S10, only NO₃-N has a poor prediction performance, while the other water quality indicators perform well. The actual NO₃-N concentration at S10 is 0.25 mg/L, which is lower than that of other samples, so the low concentration may be the reason for its poor prediction performance. For S11, the prediction results for all water quality indicators are suboptimal. From the map, it is identified as a third-class tributary of the Maozhou River. It is inferred that the water quality components of this point are quite different from other points, leading to poor prediction performance. S1, S2, S11, and S10 are located at the end of tributaries, where water quality components are unstable, which may explain the overall lower prediction results at these points.

Maozhou River is a typical rain-fed river, primarily replenished by a sewage treatment plant. As a result, the chemical composition of the river is relatively stable, and there is a strong linear relationship between the model and the concentration of the water quality indicators. Therefore, in this study, the RR model performs the best, which is better than other models that are good at nonlinear fitting. In comparison to other studies, Faucheux et al. used a submersible UV–Vis spectrometer in the headwater catchment of Kervidy–Naizin, Western France. Global calibration of PLS regression was used as the default configuration of the UV–Vis spectrometer for nitrate, dissolved organic carbon, and suspended solids during the base flow and flood periods. The results showed that there was a good correlation between laboratory data and spectrometer data [40]. Etheridge et al. tested a commercial UV–Vis spectrometer under tidal swamp conditions and found that PLSR showed significant robustness and flexibility across all the tested model species, even under rapidly changing water concentrations and salinity [41]. While most existing spectrometers rely on the PLS model, the findings of this study suggest that a simpler model could be used to achieve similar or even better results.

3.5. Innovations, Limitations, and Future Work

This article focuses on spectral surrogate monitoring research; the main innovations are as follows: (1) Extend the study of commonly researched indicators like COD and NO₃-N to include five indicators: TOC, BOD₅, COD, TN, and NO₃-N, achieving good prediction results. (2) In the wavelength processing step, traditional methods such as PCA and single wavelength selection were discarded in favour of the wavelength selection method, which not only improved the model’s accuracy but also simplified the surrogate model. Furthermore, this enhanced our understanding of the chemical mechanisms underlying spectral surrogate monitoring of various water quality indicators.

This study has several limitations, which are mainly as follows: (1) The model performance in the rainy season was not examined, which is another interesting point of study. (2) Long-term in situ high-frequency monitoring was not conducted at the field site. The reliability of the equipment as well as the model algorithm in real-world applications, remains a subject of further debate. (3) The underlying mechanisms behind the selection of preferred wavelengths for different water quality indicators were not examined. This is a recent concern and consensus among other scholars [42].

In the future, research on UV–Vis spectral water quality surrogate monitoring will develop towards more water quality indicators, wider adaptability, higher accuracy and more stable prediction results. The application of UV–Vis spectroscopy is expected to increase significantly. Future work will focus on the following areas:

(1) The UV–Vis spectral water quality surrogate monitoring model will be used to obtain high-frequency, multi-parameter water quality concentration data and for online water quality data anomaly identification, improving the response speed of anomaly capture as well as sensitivity.

(2) UV–Vis monitoring methods will be used to investigate issues related to pipe network overflows. UV–Vis spectroscopy will be used for long-term online in situ monitoring with a large volume of data. This method will be used to reduce the influence of chance errors on the analysis results and to extract key information from the spectral data that is difficult to detect from the actual concentration data.

(3) The existing water quality model and UV–Vis spectral data will be combined with the input spectral data and some other water quality indicator data so that the water quality model can directly output the spectral data and then reflect the specific water quality indicators.

4. Conclusions

This paper addresses the research hotspot of water quality sensing monitoring by UV–Vis spectroscopy and carries out research on the selection of characteristic wavelengths and models for five typical water quality indicators. The water quality of a rain-fed river in the dry season is selected for the study. The following conclusions are obtained:

(1) A noise reduction method for the raw spectral data was established. The sliding average method was used for noise reduction of the raw data. The maximum signal-to-noise ratio (SNR) was used for the selection of the sliding window. The most suitable noise reduction window was obtained.

(2) An algorithm for the selection of characteristic wavelengths was established. The selected characteristic wavelength is used as the input of the surrogate detection model. Through comparison and analysis, CARS, which has more significant advantages, is selected as the wavelength optimisation model for each water quality indicator.

(3) Among the spectral water quality surrogate monitoring machine learning models, RR methods are applied to the TOC, BOD₅, COD, TN, and NO₃-N water quality indicators, and the medians of R² are 0.80, 0.64, 0.82, 0.97, and 0.96, respectively. TOC, COD, and TN comply with “The Technical Specification for Acceptance of Water Pollution Source Online Monitoring System (COD_Cr, NH₃-N, etc.)” (HJ 354-2019) in the accuracy, while there are no specific provisions for BOD₅ and NO₃-N.

(4) The study concluded that characteristic wavelength selection significantly reduced the complexity of the prediction model, resulting in improved performance from the regression model. The complexity of the model was reduced, and the instrument configuration became more operationally efficient. The study provides valuable technical details and meaningful reference points for water quality spectral surrogate monitoring.

(5) Further research in the field of spectral surrogate monitoring is needed to explore additional technical details. This field holds considerable application potential in scenarios such as water quality data anomaly identification, water quality model construction, and high-resolution water quality analysis.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/w17030343/s1, SI S1—Pure solution configuration methods; SI S2—Pure solution results; SI S3—Optimal parameter selection of water quality indicator prediction model based on CARS.

Author Contributions

Conceptualisation, C.C. and M.L.; methodology, C.C. and M.L.; software, M.L. and W.W.; validation, S.C. and W.W.; investigation, C.C., M.L. and S.C.; resources, C.C., Y.P. and H.L.; data curation, M.L. and S.C.; writing—original draft preparation, C.C.; writing—review and editing, H.L. and Q.L.; visualisation, M.L. and W.W.; supervision, M.L. and Q.L.; project administration, C.C., Y.P. and Q.L.; funding acquisition, C.C. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science, Technology and Innovation Commission of Shenzhen (Grant KCXFZ20201221173603009) and the Power Construction Corporation of China(ST-ZB-ZC-JY-JS-2023-02).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that this study received funding from Power Construction Corporation of China. Power China Eco-Environmental Group Co., Ltd. is the wholly owned subsidiary of Power Construction Corporation of China. PowerChina Water Environmental Technology Co., Ltd. is the wholly owned subsidiary of Power China Eco-Environmental Group Co., Ltd. Author Meiyu Luo was employed by the company Shenzhen Zhishu Environmental Technology Co., Ltd.

References

Sun, Y.; Wang, D.; Li, L.; Ning, R.; Yu, S.; Gao, N. Application of Remote Sensing Technology in Water Quality Monitoring: From Traditional Approaches to Artificial Intelligence. Water Res. 2024, 267, 122546. [Google Scholar] [CrossRef] [PubMed]
Shi, Z.; Chow, C.W.K.; Fabris, R.; Liu, J.; Jin, B. Applications of Online UV-Vis Spectrophotometer for Drinking Water Quality Monitoring and Process Control: A Review. Sensors 2022, 22, 2987. [Google Scholar] [CrossRef]
Storey, M.V.; van der Gaag, B.; Burns, B.P. Advances in On-Line Drinking Water Quality Monitoring and Early Warning Systems. Water Res. 2011, 45, 741–747. [Google Scholar] [CrossRef]
Bastian, R.; Weberling, R.; Palilla, F. Ultraviolet Spectrophotometric Determination of Nitrate... Application to Analysis of Alkaline Carbonates. Anal. Chem. 1957, 29, 1795–1797. [Google Scholar] [CrossRef]
Armstrong, F.A.J. Determination of Nitrate in Water Ultraviolet Spectrophotometry. Anal. Chem. 1963, 35, 1292–1294. [Google Scholar] [CrossRef]
Hoather, R.C.; Rackham, R.F. Oxidised Nitrogen in Waters and Sewage Effluents Observed by Ultra-Violet Spectrophotometry. Analyst 1959, 84, 548–551. [Google Scholar] [CrossRef]
Ogura, N.; Hanya, T. Ultraviolet Absorbance as an Index of the Pollution of Seawater. J. Water Pollut. Control Fed. 1968, 40, 464–467. [Google Scholar]
Chevakidagarn, P. BOD5 Estimation by Using UV Absorption and COD for Rapid Industrial Effluent Monitoring. Environ. Monit. Assess. 2007, 131, 445–450. [Google Scholar] [CrossRef]
Chellaiah, C.; Anbalagan, S.; Swaminathan, D.; Chowdhury, S.; Kadhila, T.; Shopati, A.K.; Shangdiar, S.; Sharma, B.; Amesho, K.T.T. Integrating Deep Learning Techniques for Effective River Water Quality Monitoring and Management. J. Environ. Manag. 2024, 370, 122477. [Google Scholar] [CrossRef]
Verma, A.K.; Singh, T.N. Prediction of Water Quality from Simple Field Parameters. Environ. Earth Sci. 2013, 69, 821–829. [Google Scholar] [CrossRef]
Lepot, M.; Torres, A.; Hofer, T.; Caradot, N.; Gruber, G.; Aubin, J.-B.; Bertrand-Krajewski, J.-L. Calibration of UV/Vis Spectrophotometers: A Review and Comparison of Different Methods to Estimate TSS and Total and Dissolved COD Concentrations in Sewers, WWTPs and Rivers. Water Res. 2016, 101, 519–534. [Google Scholar] [CrossRef]
Guo, Y.; Liu, C.; Ye, R.; Duan, Q. Advances on Water Quality Detection by UV-Vis Spectroscopy. Appl. Sci. 2020, 10, 6874. [Google Scholar] [CrossRef]
Berisha, V.; Krantsevich, C.; Hahn, P.R.; Hahn, S.; Dasarathy, G.; Turaga, P.; Liss, J. Digital Medicine and the Curse of Dimensionality. npj Digit. Med. 2021, 4, 153. [Google Scholar] [CrossRef]
Yun, Y.-H.; Li, H.-D.; Deng, B.-C.; Cao, D.-S. An Overview of Variable Selection Methods in Multivariate Analysis of Near-Infrared Spectra. TrAC Trends Anal. Chem. 2019, 113, 102–115. [Google Scholar] [CrossRef]
Feng, S.; Zhao, D.; Guan, Q.; Li, J.; Liu, Z.; Jin, Z.; Li, G.; Xu, T. A Deep Convolutional Neural Network-Based Wavelength Selection Method for Spectral Characteristics of Rice Blast Disease. Comput. Electron. Agric. 2022, 199, 107199. [Google Scholar] [CrossRef]
McCrea, R.; King, R.; Graham, L.; Börger, L. Realising the Promise of Large Data and Complex Models. Methods Ecol. Evol. 2023, 14, 4–11. [Google Scholar] [CrossRef]
GB3838-2002; Surface Water Environmental Quality Standards; National Environmental Protection Agency of China: Beijing, China, 2002.
HJ 915-2017; Technical Specification for Automatic Monitoring of Surface Water; Ministry of Environmental Protection of the People’s Republic of China: Beijing, China, 2017.
Menini, L.; Possieri, C.; Tornambe, A. Observers for Linear Systems by the Time Integrals and Moving Average of the Output. IEEE Trans. Automat. Control 2019, 64, 4859–4874. [Google Scholar] [CrossRef]
Schimmack, M.; Mercorelli, P. An On-Line Orthogonal Wavelet Denoising Algorithm for High-Resolution Surface Scans. J. Frankl. Inst. 2018, 355, 9245–9270. [Google Scholar] [CrossRef]
Shekar, S.; Chien, C.-C.; Hartel, A.; Ong, P.; Clarke, O.B.; Marks, A.; Drndic, M.; Shepard, K.L. Wavelet Denoising of High-Bandwidth Nanopore and Ion-Channel Signals. Nano Lett. 2019, 19, 1090–1097. [Google Scholar] [CrossRef]
Farajzadeh-D, M.-G.; Hosseini Sani, S.K.; Akbarzadeh, A. Performance Enhancement of Model Reference Adaptive Control through Normalized Lyapunov Design. Proc. Inst. Mech. Eng. Part I J. Syst. Control Eng. 2019, 233, 1209–1220. [Google Scholar] [CrossRef]
Araújo, M.C.U.; Saldanha, T.C.B.; Galvão, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The Successive Projections Algorithm for Variable Selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. 2001, 57, 65–73. [Google Scholar] [CrossRef]
Hailong, W.; Guoguo, Y.; Yu, Z.; Yidan, B.; Yong, H. Detection of Fungal Disease on Tomato Leaves with Competitive Adaptive Reweighted Sampling and Correlation Analysis Methods. Spectrosc. Spectr. Anal. 2017, 37, 2115–2119. [Google Scholar]
Tang, G.; Huang, Y.; Tian, K.; Song, X.; Yan, H.; Hu, J.; Xiong, Y.; Min, S. A New Spectral Variable Selection Pattern Using Competitive Adaptive Reweighted Sampling Combined with Successive Projections Algorithm. Analyst 2014, 139, 4894. [Google Scholar] [CrossRef]
Rato, T.J.; Reis, M.S. Multiresolution Interval Partial Least Squares: A Framework for Waveband Selection and Resolution Optimization. Chemom. Intell. Lab. Syst. 2019, 186, 41–54. [Google Scholar] [CrossRef]
He, Y.; Zhao, Y.; Zhang, C.; Li, Y.; Bao, Y.; Liu, F. Discrimination of Grape Seeds Using Laser-Induced Breakdown Spectroscopy in Combination with Region Selection and Supervised Classification Methods. Foods 2020, 9, 199. [Google Scholar] [CrossRef] [PubMed]
Ferrer Palomino, A.; Sánchez Espino, P.; Borrego Reyes, C.; Jiménez Rojas, J.A.; Rodríguez y Silva, F. Estimation of Moisture in Live Fuels in the Mediterranean: Linear Regressions and Random Forests. J. Environ. Manag. 2022, 322, 116069. [Google Scholar] [CrossRef]
Ma, M.; Gu, L.; Shen, Y.; Guan, Q.; Wang, C.; Deng, H.; Zhong, X.; Xia, M.; Shi, D. Computational Framework for Turbid Water Single-Pixel Imaging by Polynomial Regression and Feature Enhancement. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
Shi, Z.; Han, M. Ridge Regression Learning in ESN for Chaotic Time Series Prediction. Control. Decis. 2007, 22, 258–261. [Google Scholar]
Mohammadi, H.A.; Ghofrani, S.; Nikseresht, A. Using Empirical Wavelet Transform and High-Order Fuzzy Cognitive Maps for Time Series Forecasting. Appl. Soft Comput. 2023, 135, 109990. [Google Scholar] [CrossRef]
Talebi, M.; Schuster, G.; Shellie, R.A.; Szucs, R.; Haddad, P.R. Performance Comparison of Partial Least Squares-Related Variable Selection Methods for Quantitative Structure Retention Relationships Modelling of Retention Times in Reversed-Phase Liquid Chromatography. J. Chromatogr. A 2015, 1424, 69–76. [Google Scholar] [CrossRef] [PubMed]
Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Beniwal, M.; Singh, A.; Kumar, N. Forecasting Long-Term Stock Prices of Global Indices: A Forward-Validating Genetic Algorithm Optimization Approach for Support Vector Regression. Appl. Soft Comput. 2023, 145, 110566. [Google Scholar] [CrossRef]
Wang, Z.; Shao, Y.; Ye, T.; Sun, S. Research on Optimization Method for Passive Control Strategy in CLLC-SMES System Based on BP Neural Network. J. Energy Storage 2024, 86, 111175. [Google Scholar] [CrossRef]
Chen, Y.; Song, L.; Liu, Y.; Yang, L.; Li, D. A Review of the Artificial Neural Network Models for Water Quality Prediction. Appl. Sci. 2020, 10, 5776. [Google Scholar] [CrossRef]
HJ 354-2019; Technical Specification for Acceptance of Online Monitoring Systems for Water Pollution Sources (CODCr, NH3-N, etc.). Ministry of Ecology and Environment of the People’s Republic of China: Beijing, China, 2019.
van den Broeke, J.; Langergraber, G.; Weingartner, A. On-Line and in Situ UV/Vis Spectroscopy for Multi-Parameter Measurements: A Brief Review. Spectrosc. Eur. 2006, 18, S3–S4. [Google Scholar]
Azqandi, M.; Nateq, K.; Golrizkhatami, F.; Nasseh, N.; Seyedi, N.; Moghaddam, N.S.M.; Fanaei, F. Innovative RGO-Bridged S-Scheme CuFe₂O₄@Ag₂S Heterojunction for Efficient Sun-Light-Driven Photocatalytic Disintegration of Ciprofloxacin. Carbon 2025, 231, 119725. [Google Scholar] [CrossRef]
Faucheux, M.; Fovet, O.; Gruau, G.; Jaffrézic, A.; Petitjean, P.; Gascuel, C.; Ruiz, L. Real Time High Frequency Monitoring of Water Quality in River Streams Using a UV-Visible Spectrometer: Interest, Limits and Consequences for Monitoring Strategies. Geophys. Res. Abstr. 2013, 15, EGU2013. [Google Scholar]
Etheridge, J.R.; Birgand, F.; Osborne, J.A.; Osburn, C.L.; Burchell, M.R.; Irving, J. Using in Situ Ultraviolet-Visual Spectroscopy to Measure Nitrogen, Carbon, Phosphorus, and Suspended Solids Concentrations at a High Frequency in a Brackish Tidal Marsh: In Situ Spectroscopy to Monitor N, C, P, TSS. Limnol. Oceanogr. Methods 2014, 12, 10–22. [Google Scholar] [CrossRef]
Jie, C.; Lifu, Z.; Linshan, Z.; Hongming, Z.; Qingxi, T. Research Progress on Online Monitoring Technologies of Water Quality Parameters Based on Ultraviolet-Visible Spectra. Remote Sens. Nat. Resour. 2021, 33, 1–9. [Google Scholar]

Figure 1. UV–Vis spectrum water quality surrogate monitoring platform. (a) Experimental level and (b) industrial level.

Figure 2. Distribution of sampling points in the Maozhou River.

Figure 3. UV–Vis absorption spectrum of 50 mg/L KHP solution before and after denoising (a) before denoising; (b) after denoising.

Figure 4. Comparison of the accuracy of the surrogate model based on different wavelength processing methods. SingleWave: single wavelength at 220 nm, PCA: PCA method with the number of 5 components, WaveSel: 45 wavelengths were selected by using the CARS method, FullSpec: full spectrum method.

Figure 5. Distribution of characteristic wavelengths selected by CARS for water quality indicator prediction.

Figure 6. Prediction performance of seven wavelength selection methods for water quality indicators based on the PLS model. (Exmum: extremum method, Corr: correlation coefficient method, LinCoef: linear regression coefficient method, SPA: successive projections algorithm, CARS: competitive adaptive reweighted sampling, iPLS: interval partial least squares method, Mix: mixing method).

Figure 7. Preferred wavelengths corresponding to different water quality indicators.

Figure 8. Box plots of the predicted effects of six prediction models on water quality indicator concentrations.

Figure 9. Prediction results of different water quality indicators using the CARS-based RR model (a) TOC; (b) BOD₅; (c) COD; (d) TN; and (e) NO₃-N.

Figure 10. Comparison of the prediction performance of different water quality indicators.

Figure 11. Relative error of the predicted water quality indicator concentrations at each point.

Table 1. Performance of five water quality indicators in spectral surrogate monitoring for training set and testing set.

WQ Indicator	Training Set			Testing Set
WQ Indicator	MSE	RMSE	R²	MSE	RMSE	R²
TOC	0.325	0.570	0.816	0.346	0.588	0.779
BOD₅	0.532	0.729	0.725	0.631	0.794	0.682
COD	0.311	0.558	0.823	0.356	0.597	0.791
TN	0.079	0.281	0.989	0.145	0.380	0.945
NO₃-N	0.034	0.183	0.983	0.183	0.428	0.941

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, C.; Luo, M.; Wang, W.; Ping, Y.; Li, H.; Chen, S.; Liang, Q. Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing. Water 2025, 17, 343. https://doi.org/10.3390/w17030343

AMA Style

Chen C, Luo M, Wang W, Ping Y, Li H, Chen S, Liang Q. Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing. Water. 2025; 17(3):343. https://doi.org/10.3390/w17030343

Chicago/Turabian Style

Chen, Chenyu, Meiyu Luo, Wenyu Wang, Yang Ping, Hongming Li, Siyuan Chen, and Qian Liang. 2025. "Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing" Water 17, no. 3: 343. https://doi.org/10.3390/w17030343

APA Style

Chen, C., Luo, M., Wang, W., Ping, Y., Li, H., Chen, S., & Liang, Q. (2025). Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing. Water, 17(3), 343. https://doi.org/10.3390/w17030343

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Characteristic Wavelength Selection and Surrogate Monitoring for UV–Vis Absorption Spectroscopy-Based Water Quality Sensing

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Spectrum Detection Platform Construction

2.1.2. Pure Solution Preparation

2.1.3. River Water Samples

2.2. Methods

2.2.1. Denoising of Spectral Data

2.2.2. Candidate Algorithms of Characteristic Wavelength Selection

2.2.3. Spectral Surrogate Monitoring Statistical Learning Methods

2.2.4. Evaluation Methodology

3. Results and Discussion

3.1. Characteristic Wavelength Selection

3.2. Comparison of Surrogate Monitoring Algorithms and Screening of Optimal Algorithms

3.3. Performance Differences Among Water Quality Indicators in Spectral Surrogate Monitoring

3.4. Performance of Surrogate Monitoring Models Across Different River Sections

3.5. Innovations, Limitations, and Future Work

4. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI