1. Introduction
Many chemical agents (CAs) have been developed during the advancement of human civilization. Since many CAs, which are harmful when in contact with the human body, are colorless and odorless, it is difficult to respond to threats of chemical gas terror and chemical gas leak accident quickly. To deal with these threats, non-contact CA detection techniques are essential. As one of the non-contact CA detection techniques, the Raman spectrometer, which is capable of non-destructive analysis of target materials, has been studied [
1,
2,
3]. When a light is irradiated onto a material, some fraction of light is scattered with some frequency change, which is called Raman scattering. Raman scattering depends on the molecular structure and characteristics of the material [
4]. Then, a Raman spectrometer measures this Raman scattering and generates a spectrum. In a non-contact measurement setting, the measured spectrum contains not only the Raman (scattering) spectrum of the CA, but also those of background materials and noise. Accordingly, many CA detection algorithms using Raman spectroscopy have been studied [
5,
6,
7,
8,
9,
10].
However, CA detection algorithms are disturbed by the baseline in the measured spectrum. The baseline is mainly caused by fluorescence that occurs almost simultaneously with the generation of the Raman scattering [
11]. In order to suppress the baseline, several methods of physically blocking the fluorescence signature have been proposed [
12,
13,
14,
15]. These methods are based on the fact that the lifetime of fluorescence is much longer than that of the Raman scattering. By shutting down the gate before fluorescence occurs, the fluorescence signature can be suppressed in the measured spectrum. However, it is hard to control the gate open time very precisely (about a
s scale). Therefore, the baseline correction algorithms, which estimate the baseline from the measured spectrum and remove it, have been proposed [
16,
17,
18,
19,
20].
The baseline is usually a smooth curve in contrast to a Raman spectrum composed of sharp peaks. Filter-based algorithms were proposed [
16,
17]. Because these filter-based algorithms distort the Raman spectrum, penalized least squares(PLS)-based algorithms have been proposed [
18,
19,
20]. These PLS-based algorithms estimate the baseline by smoothing a curve of the measured spectrum while giving a penalty to the signatures at Raman shifts suspected of having the Raman scattering. PLS-based algorithms distinguish the baseline from the Raman spectrum according to the difference in curvature. However, when the signal-to-noise ratio (SNR) of the measured spectrum is low, the curvature of the Raman spectrum is degraded and PLS-based algorithms calculate the incorrect baseline.
In this paper, to estimate the baseline while minimizing the distortion of the Raman spectrum, we propose the algorithm that estimates both the baseline and Raman spectrum simultaneously. Assuming that the baseline is a weighted summation of broad Gaussian vectors, we model the measured spectrum as a linear combination of broad Gaussian vectors, the reference CA spectrum, and bases of background materials. Then, we obtain coefficients of broad Gaussian vectors by the least squares method and calculate the baseline using these coefficients. From the experiment with real CA data measured by a Raman spectrometer (Korea Raman Agent Monitoring System, K-RAMS), we demonstrate that the proposed baseline correction algorithm accurately estimates the baseline while preserving the Raman spectrum of the CA and background. We also show that the proposed algorithm improves the CA detection performance better than other baseline correction algorithms.
There are three contributions in this article. The first contribution is to accurately estimate both the Raman signature and baseline using reference spectra of target CAs, background basis matrix and broad Gaussian vectors. The second contribution is to propose how to design the broad Gaussian vectors. We introduce some conditions for Gaussian vectors to estimate the baselines effectively. Based on these conditions, the mean and variance of each Gaussian vector are determined. The final contribution is to show the novelty of the proposed algorithm via experiments with real CA measurements.
The remainder of this paper is organized as follows. In
Section 2, we introduce the system model for measured spectra and review the conventional baseline correction algorithms. In
Section 3, we explain the proposed baseline correction algorithm that estimates the baseline and Raman signature simultaneously using broad Gaussian vectors, the reference spectrum of CA, and the basis function of background materials. We also discuss design parameters for broad Gaussian vectors. In
Section 4, real-data experiments presenting the superiority of the proposed baseline correction algorithm are described. The final discussions are drawn in
Section 5.
2. Conventional Baseline Correction Algorithms
Before we review the conventional baseline correction algorithms, we briefly introduce the signal model of the spectrum measured by the Raman spectrometer. Let
denote the measured spectrum, where
is a spectral value at the
ith Raman shift, for
, and
p is the number of channels. Here,
means that the vector
is a vector with
p real values. Then, the measured spectrum
can be represented as a sum of the Raman spectrum, the baseline, and the noise [
18,
19,
20,
21] as
where
is the Raman spectrum,
denotes the baseline, and
is the noise signature. The noise emerged by the spectrometer is modeled as the Gaussian noise with mean
and covariance
, i.e.,
. Here,
is a diagonal matrix of which diagonal components are
,
is the correction factor and
implies that the matrix
has
p real rows and
p real columns.
In (1), the Raman spectrum
is represented as a linear combination of the reference spectra of target CAs and the background basis matrix [
5] as follows:
where
is the reference CA matrix that consists of
C reference CA spectra
, for
,
refers to the background basis matrix composed of
M bases of the background materials
, for
. Then,
is the intensity vector for each CA signature,
denotes the coefficient vector of the background basis matrix,
, and
. Background basis functions
are obtained by measuring many Raman spectra for background materials, correcting Raman spectra, applying the singular value decomposition (SVD) to the corrected Raman spectra, and extracting singular vectors corresponding to large singular values [
22].
Substituting (2) into (1), the measured Raman spectrum is expressed as
Let
denote the baseline corrected spectrum, i.e.,
. Then, the baseline corrected spectrum
follows the linear subspace model (LSM) under two hypotheses as
where
and
represent hypotheses for the absence and presence of the CA, respectively. Because the
is a deterministic vector and
is a Gaussian random vector, the
is also the Gaussian random vector. As we can see in (4), if the baseline remains in the spectrum
, the spectrum
does not follow the LSM. It results in reducing the accuracy of the detection algorithms. To address this problem, many baseline correction algorithms have been proposed. We introduce these baseline correction algorithms.
2.1. Iterative Median Filter (IMF)
In general, the baseline is in the form of a smooth curve unlike the Raman signature composed of several peaks of certain shapes. The median filter constructs a window with n spectral values nearby a spectral value in the spectrum, finds the median value in the window, and replace with . The median filter repeats this process for all spectral values and obtains the smoothing curve of the spectrum. The iterative median filter finds the baseline by applying the median filter iteratively.
2.2. Rolling Circle Filter (RCF)
The rolling circle filter (RCF) is an algorithm that estimates the baseline by rolling a circle of an appropriate size in contact with the measured spectrum. In the process of rolling the circle, the curvature radius of the circle is smaller than that of the baseline, but larger than that of the Raman spectrum so that the circle is tangent to the baseline, but not to the Raman spectrum. Finally, the baseline can be calculated by connecting arcs of the circles tangent to the baseline.
2.3. Asymmetric Least Squares (ALS)
Since these filter-based algorithms consider low frequency component signatures of the Raman spectrum as the baseline, they cause the distortion of the Raman spectrum during the baseline correction. To deal with this problem, algorithms based on the penalized least squares (PLS) have been proposed. These algorithms estimate the baseline using the least squares fitting while giving a penalty to the Raman shifts suspected of having Raman signatures. Let us define
as the cost function according to the baseline
as follows:
where
is a penalty matrix that is a diagonal matrix composed of a penalty
at the
ith Raman shift, for
,
is a regularization coefficient for the smoothing, and
is the secondary order difference matrix.
In (5), the term
implies fitness of the baseline
to the spectrum
and
represents smoothness of the baseline. The optimal baseline
is obtained by solving
as
In (6), the baseline is determined by the measured spectrum and the penalty matrix . Let and denote the Raman shifts with and without the Raman spectrum, respectively. Then, the penalty is close to 1, otherwise, the penalty becomes almost 0. If we know Raman shifts at which the Raman signature exists, we obtain the penalty matrix exactly. However, it is an unrealistic assumption to know these Raman shifts beforehand.
To estimate the baseline without the information about Raman shifts having the Raman spectrum, Eilers and Boelens proposed an asymmetric least squares (ALS) [
18]. In the ALS, the penalty
is allocated according to the baseline
and measured spectral value
at the
ith Raman shift as follows:
where
represents the asymmetric parameter determining the penalty and is recommended to be assigned from
to
. In the ALS, if the measured spectral value
exceeds the baseline
, it is determined that there is the Raman spectrum at the
ith Raman shift. From (6) and (7), the baseline
is not expressed in a closed-form solution. Therefore, we obtain the penalty matrix
and the baseline
using an iterative method until the penalty matrix does not change.
2.4. Adaptive Iterative Reweighted Penalized Least Squares (AirPLS)
In the ALS, penalties at Raman shifts with the Raman spectrum are all the same. Since the curvature of the Raman spectrum at each Raman shift varies, the penalty need to be changed according to the Raman shift. From this point of view, Zhang proposed the adaptive iterative reweighted penalized least squares (AirPLS) [
19]. In the AirPLS, the penalty is determined by the difference between the measured spectrum and the baseline. At the
jth iteration step, the penalty
is obtained as
where the vector
consists of negative elements of
. Like the ALS, the baseline
and the penalty matrix
are obtained by alternating (6) and (8) iteratively. Here, the condition to terminate the iteration is as follows:
2.5. Asymmetrically Reweighted Penalized Least Squares (ArPLS)
The ALS and AirPLS extract the baseline from the measured spectrum well while preserving the Raman signature. However, these algorithms are vulnerable to random noises. To deal with this problem, Baek proposed an asymmetrically reweighted penalized least squares (ArPLS) based on the partially balanced weighting scheme. The ArPLS acquires the mean and variance of noise signatures at Raman shifts without the Raman spectrum. Using these statistics, penalties at Raman shifts with the Raman spectrum are corrected as
where
and
are the mean and the standard deviation of
. The baseline is acquired by alternating (6) and (10) iteratively.
3. Proposed Baseline Correction Algorithm
Conventional baseline correction algorithms estimate the baseline under the assumption that the curvature of the Raman signature is significantly larger than that of the baseline. However, when the measured Raman spectrum has a low signal-to-noise ratio (SNR), the curvature of the Raman signature is reduced, resulting in distortion of the Raman signature during the baseline correction. In this section, we propose a baseline correction algorithm that is more suitable to detection algorithms that exploits the background basis matrix and the reference spectrum of the target CA.
Since the baseline is generally a curve with less curvature, it is modeled as a linear combination of broad Gaussian vectors [
23] as
where
is a broad Gaussian matrix composed of
L broad Gaussian vectors
, for
, and
is a coefficient vector for broad Gaussian vectors. Since widths of broad Gaussian vectors are wider than those of peaks in the Raman spectrum,
and
are linearly independent. By substituting (11) into (3), the measured spectrum
is represented as the LSM form as
where
and
. To remove the baseline while minimizing the distortion of the Raman spectrum, we need to estimate both the baseline
and the Raman spectrum
simultaneously. It is accomplished by obtaining
via the least squares method as
Finally, using in , we remove the estimated baseline from the measured spectrum .
The accuracy of the estimated baseline depends on the broad Gaussian vector
, for
. The
ith component
of the broad Gaussian vector
is expressed as the Gaussian function with mean
and variance
as
where
is the wavenumber of the
ith Raman shift. The mean
determines the interval between the broad Gaussian vectors. In order to estimate a shape baseline generated at outer Raman shifts, it is recommended that the means of the first and last broad Gaussian vectors be set to wavenumbers of the first and last Raman shift, respectively, i.e.,
and
. If we do not follow this recommendation, coefficients of the first and last Gaussian vectors are so large that coefficient estimation for a shape baseline generated at outer Raman shifts becomes unstable. Therefore, we fix the means of the first and last broad Gaussian vectors. Let broad Gaussian vectors be equally spaced. Then, mean
satisfying this condition is obtained as
Then, the interval between broad Gaussian vectors is determined as the number of broad Gaussian vectors L.
The variance
implies the width of a broad Gaussian vector. It is recommended that the variance be set to
times the interval between adjacent broad Gaussian vectors as
where
denotes the interval between adjacent broad Gaussian vectors. As shown in (15) and (16), the mean
and variance
are determined by the number of Gaussian vectors
L. The larger the number of Gaussian vectors is used, the more accurate the baseline is estimated. However, when the number of Gaussian vectors exceeds a certain level and the width of the Gaussian vectors becomes narrower than that of peaks of the Raman signature, (12) is linearly dependent and both the Raman spectrum and baseline are over-fitted. This overfitting skews the estimation results for both the Raman signature and baseline.The width of each peak in the Raman spectrum is less than 350 cm
in general. Therefore, it is recommended that the variance of each broad Gaussian vector exceed
cm
. In the experiment, we used 11 broad Gaussian vectors of which variances are 265 cm
when the wavenumber range of the Raman shift is from 375 to 3500 cm
. These broad Gaussian function are shown in
Figure 1.
In fact, the baseline can be modeled as a linear combinations of other basis functions, i.e., polynomial functions. Nevertheless, the broad Gaussian vectors have some benefits. The first benefit is easy to design broad Gaussian vectors. Design parameters for the broad Gaussian vectors are only two, i.e., mean and variance, and are determined by the number of broad Gaussian vectors
L. The second benefit is that the broad Gaussian vectors do not cause Gibson errors. To estimate the baseline effectively, it is recommended that all basis functions have the same sign. When designing the basis function, all values below zero are set to 0, which makes the ringing artifacts that are mainly caused by discontinuity of basis functions [
24]. On the other hand, the broad Gaussian vectors always have positive values and are free from ringing artifacts.
4. Experimental Results
In this section, we describe experiments using real CA data to compare baseline correction algorithms. Raman spectrum data used in the experiments were collected by the Korea Raman Agent Monitoring System (K-RAMS, Agency for Defense Development, Korea), which provides data with a resolution of 3.3 cm
from 375 to 3500 cm
with 947 channels. In the K-RAMS, a KrF excimer laser at 248.35 nm was used as the light source to generate Raman scattering of chemicals [
25].
In the experiment, the cyclosarin (GF) was selected as a target chemical agent.
Figure 2a shows the reference Raman spectrum of the GF. In
Figure 2a, there are a main peak at 2700∼3100 cm
bands and several subpeaks at 500∼1700 cm
bands. The reference CA matrix consists of reference spectra of seven target CAs, i.e., the GF, distilled mustard (H), nitrogen mustard (HN), benzyl chloride, DMMP, MES, and phosphorus trichloride. The background basis matrix
is composed of six basis spectra of major background materials, i.e., the oxygen, nitrogen, concrete, asphalt, grass and soil. The molecular structures and reference Raman spectra of seven target CAs are introduced in [
26].
The experiment conditions are as follows. The distance between the spectrometer and each target chemical was set to 1 m. We measured the concrete background 1000 times. Then, we dropped GF 0.5
L on the concrete background and measured the GF 500 times. We denote concrete background and GF spectra as concrete-only spectra and GF-on-concrete spectra, respectively.
Figure 2b shows the GF-on-concrete and concrete-only spectra. In the GF-on-concrete spectrum, the main peak of the GF is confirmed. However, subpeaks of the GF are obscured by noise signatures. Since Raman spectra were taken at a very close range (about under 10 cm) in general contact measurements, the signal-to-noise ratio (SNR) of the chemical agent (CA) was so high that every subpeak is well observed. However, for the non-contact measurements (about more than 0.5 m), some fractions of Raman scattering are measured by the Raman spectrometer. In both spectra, peaks of the oxygen and nitrogen are represented at 1550 and 2300 cm
bands, respectively. We also see the baseline throughout the entire band.
First, we compared baselines estimated by the proposed algorithm according to the number of Gaussian vectors as shown in
Figure 3. We applied the proposed algorithm with 5, 11, and 30 Gaussian vectors, i.e.,
and 30, to the GF-on-concrete and concrete-only spectra. In cases of
and 11, the proposed algorithm well estimates the baseline except for Raman spectrum signatures, such as the peaks of the oxygen, nitrogen, and GF. It is confirmed that the baseline with
is more accurate than that with
. However, the proposed algorithm with
does not approximate the baseline due to an overfitting and causes the distortion of the Raman spectrum.
For more objective competition, we adopt the root mean square modeling error (RMSME), which is a metric evaluating how baseline correction algorithms effectively removes the baseline while preserving the Raman spectrum. The modeling error
is determined from the baseline-corrected spectrum
as
where the estimated coefficient vector
is obtained by the least squares method as
. Then, the RMSME is defined as follows:
where
is the
ith value of the modeling error
.
Table 1 describes RMSME averages of 500 GF-on-concrete spectra and 1000 concrete-only spectra according to the number of broad Gaussian vectors. In
Table 1, ‘Non BC’ indicates the measured spectra without any baseline correction algorithms. The RMSMEs of spectra without the baseline corrections are higher than those with the proposed baseline correction algorithm. In case of
, modeling errors are minimized, which implies the proposed baseline correction algorithm with
accurately estimates the baseline while preserving the Raman spectrum as much as possible.
Next, we compared the proposed baseline correction algorithm with other baseline correction algorithms mentioned in
Section 2, i.e., the iterative median filter (IMF), rolling circle filter (RCF), asymmetric least squares (ALS), adaptive iterative reweighted penalized least squares (AirPLS), and asymmetrically reweighted penalized least squares (ArPLS). We found the optimal design parameters for each algorithm, which minimizes the RMSME numerically. The optimal design parameters for each algorithm are as follows. In the case of the IMF, the window size is
cm
and the number of iterations is 5. For the RCF, the radius of the circle is set to
cm
. The regularization parameters of the ALS, AirPLS, and ArPLS are determined as 1000, 50, and 200, respectively. In the case of the proposed algorithm, the optimum number of Gaussian vectors is 11.
Figure 4a,b depict baselines estimated by several baseline correction algorithms from the GF-on-concrete spectrum and concrete-only spectrum, respectively. In cases of IMF and RCF, a little Raman spectrum of the GF at 2700∼3100 cm
band is regarded as the baseline. The baseline estimated by the ALS is located below the other baselines since the ALS is affected by the negative part of the noise. On the other hand, AirPLS, ArPLS, and the proposed baseline correction algorithm estimate the baseline.
For more objective comparisons, we also obtained RMSME averages of 500 GF-on-concrete spectra and 1000 concrete-only spectra for baseline correction algorithms as shown in
Table 2. It is confirmed that any baseline correction algorithms can suppress the modeling error. Since IMF and RCF distort peaks of GF shown in
Figure 4, RMSMEs for IMF and RCF are less than those for other baseline correction algorithms. The proposed algorithm minimizes the RMSMEs, because the proposed algorithm preserves the Raman spectrum as much as possible by estimating the baseline and Raman spectrum simultaneously.
Finally, we analyze the effect of each baseline removal algorithm on the CA detection performance using the receiver of characteristic (ROC) curve. The ROC curve, which shows the relation between false alarm probabilities and detection probabilities, is widely used for a metric for evaluating the detection performance. In the experiment, we selected the adaptive subspace detector (ASD) as a CA detection algorithm. The ASD, which is known as the optimal detector for the LSM [
22,
27], is obtained by applying the generalized likelihood ratio test (GLRT) to (4).
The test statistic
of the ASD is defined as
where
denotes the test statistic of the ASD for the baseline-corrected spectrum
,
and
are the orthogonal projection matrices for a subspace spanned by
and
, respectively. Here,
and
denote the Raman signature basis matrix and background signature basis matrix, respectively, as described in
Section 2. If
exceeds a detection threshold
, it is determined that the hypothesis
is true. Otherwise,
is true.
Then, we applied the ASD to baseline corrected spectra and obtained ROC curves. To acquire the ROC curves, 500 GF-on-concrete spectra and 1000 concrete-only spectra were used.
Figure 5 presents the ROC curves of the ASD according to the baseline correction algorithms. The closer the ROC curve is to the upper left, the better detection performance is, since it has the higher detection probability under the same false alarm probability. It can be seen that the detection performance is good in order of the proposed algorithm, ArPLS, AirPLS, ALS, IMF, RCF, and non-baseline correction. This result is in agreement with the result pertaining to the RMSME averages in
Table 2.
We conducted another experiment with a phosphorus trichloride (PH) on the asphalt background. First, we graphically compared the baseline correction results according to several baseline correction algorithms.
Figure 6a shows the reference Raman spectrum of the PH. In
Figure 6a, there are a main peak at the 450∼650 cm
band and several subpeaks at the 650∼1800 cm
band. The experiment conditions are almost the same as the GF experiment. We measured the asphalt background 1600 times. Then, we dropped 2
L of the PH on the asphalt background and measured the PH 500 times. We denote asphalt background and PH spectra as asphalt-only spectra and PH-on-asphalt spectra, respectively.
Figure 6b shows the PH-on-asphalt and asphalt-only spectra. In the PH-on-asphalt spectrum, the main peak of the PH is confirmed, however, some subpeaks of the PH are obscured by noise signatures. We also see the baseline throughout the entire band.
Figure 6c,d depict baselines estimated by several baseline correction algorithms from the PH-on-asphalt spectrum and asphalt-only spectrum, respectively. Like
Figure 4a,b, the AirPLS, the ArPLS, and the proposed baseline correction algorithm estimate the baseline.
Next, we also obtain the RMSME averages of 500 PH-on-asphalt spectra and 1600 asphalt-only spectra for baseline correction algorithms as shown in
Table 3. Except that RMSMEs of the IMF are higher than those of the RCF, the overall trend is the same as
Table 2. RMSMEs of the proposed algorithm are lower than other algorithms, which indicates that the proposed algorithm most accurately removes the baseline while preserving the Raman signal.
Finally, we acquired the ROC curves for each baseline correction algorithm with 500 PH-on-asphalt spectra and 1600 asphalt-only spectra, as shown
Figure 7. It can be seen that the proposed algorithm greatly improves the detection performance of the ASD. This result is in agreement with the result pertaining to the RMSME averages in
Table 3. Through these experiments, it is confirmed that the proposed baseline correction algorithm improves the detection performance of the ASD more than the other baseline correction algorithms.