1. Introduction
Laser interferometry-based Laser Doppler Vibrometry (LDV) is widely used in the field of precision measurement. In addition to the vibration measurement, LDVs are also used in non-contact speech acquisition. Vibro-acoustic sensors based on LDV are a new type of voice acquisition equipment that are widely used in the context of anti−terrorism operations, national security, and other related fields.
An LDV-based laser speech measurement system uses phase Doppler measuring techniques to obtain a speech signal by measuring the phase change of the optical signal caused by sound vibration. It is usually composed of a laser transmitter, laser receiver, amplifier and equalizer, together with some other important components. As shown in
Figure 1, the laser transmitter emits an invisible narrow-band laser, which is divided into a reference beam and a measurement beam through a polarized beam splitter (PBS1). The measurement beam then passes through a beam splitter (BS3), focusing lens (L), and quarter-wave plate (P), and is focused on the vibrating object. The reflected beam is directed via acousto-optical modulators (AOM), and is then merged and interfered with the reference beam by BS2. The laser receiver (photo-detector, D) receives the reference signal and the coherent signal. After demodulation, filtering and some signal processing steps are performed to obtain the voice signal [
1].
As laser speech detection systems use lasers for voice measurement, they can detect speech in non-contact situations, undertake long−range measurements and are easy to conceal and operate [
2].
Laser speech detection systems are usually set up in hidden places to detect conversations in conference rooms, cars, etc.
Figure 2 shows a scene where a laser speech detection system is actually being used, with the objects near the speaker (such as the computer screens, tissue box, mineral water bottle, clothes, etc.) being used as sound sensors. As the sound signal is captured through the indirect measurement of vibrations, the selection of the objects for detection has a significant impact on the speech acquisition, as does the external environment. First of all, the surface of most detected objects is extremely rough in the laser band and scattered light is therefore emitted from numerous coherent points. These scattered sources propagate in different directions and can interfere with each other in space, resulting in a random distribution of light interference and generating what is known as speckle noise (in speech, it shows clicks and small burrs as shown in the black box in
Figure 3). In addition, when the measured light is affected by interference (such as people walking, environmental occlusion, violent shaking, atmospheric turbulence, etc.), it not only makes it difficult for the measurement light to focus on the surface of the detection beacon, but also causes wavefront phase distortion in the reflected light, resulting in destructive interference. This makes the Doppler phase obtained by the detection system discontinuous between −π and π [
3]. These discontinuities lead to further kinds of speckle noise in the speech signal (such as bursts, outliers, crackles, scrapes, etc.), as shown in the red box in
Figure 3.
In speech, speckle noise is in the form of impulsive noise, which seriously reduces the quality and intelligibility of measured speech and has a significant impact on its viability for speech intelligence. Most of the people listening to laser detection signals agreed that due to the appearance of speckle noise, they were psychologically afraid to listen to the voice, and easily tired when listening to this voice, resulting in low speech recognition. In addition, because the background noise is usually removed based on the noise estimation, the irregular appearance of speckle noise affects the accuracy of this estimation. Therefore, removing the speckle noise first is also conducive to the subsequent background noise removal.
In previous works, we used a decorrelation method based on linear prediction (LP) model to detect the location of the speckle noise by improving the Noise-to-Signal Ratio (NSR) of the detection signal, and designed an interpolator to replace the speckle noise [
4]. However, the previous method used the direct threshold to judge the noise position. For very weak signals, this method has limitations and the noise location accuracy is not high.
In this paper, we present a simple yet efficient technique that can restore laser measured speech signals that are corrupted by speckle noise. The speckle noise detection method, combining decorrelation preprocessing, average short−term energy and kurtosis to extract the signal and locate the noise according to the threshold which involves relatively little calculation, thereby increasing the computing speed. The decorrelation preprocessing and the double threshold criterion highly increase the noise positioning accuracy. The method of replacing contaminated samples with linear coded samples is also efficient in restoring the signal and reducing the distortion. The results show that the proposed automatic noise detection and removal method outperforms other related methods across a wide range of degraded audio signals.
2. Related Works
Restoring audio signals that are corrupted by targeted speckle noise is a tricky process. For information acquisition, any information loss is fatal; thus, for laser speech detection it is imperative to find the location of the speckle noise and remove it in a targeted fashion instead of denoising the whole speech signal.
The restoration process can be divided into two steps: detection (finding the locations of the degraded samples) and interpolation (replacing the degraded samples with more suitable values). On the other hand, the noise detection technology can be considered to be a Voice Activity Detection (VAD) [
5,
6] technology. VAD technology can be divided into frequency-domain methods and time-domain methods. Frequency-domain based methods assume that the energy of the noise is concentrated in the high frequency band, while the energy of the speech is mainly distributed in the low frequency band [
7].
However, because laser speech detection focuses on the vibration of an object and there are numerous potential sources of interference, the frequency characteristics of the speckle noise are not completely consistent. Therefore, it is very difficult to distinguish speckle noise and speech using frequency-domain processing.
Time-domain methods include energy−based endpoint detectors [
8,
9], zero-crossing rate-based methods [
10], Autocorrelation Function (ACF) based methods [
11] and different feature combination detection methods [
12,
13,
14,
15]. Energy-based noise detection methods use differences in energy to distinguish noise and speech. However, although the speckle noise energy in laser detected speech is relatively high, in some cases, speckle energy of a very short duration is close to the energy of the speech signal. Therefore, it is impossible to determine an appropriate threshold. Zero crossing rate-based methods represent the number of times a frame of speech signal waveform passes through a horizontal axis. This reflects, in outline, the frequency characteristics of the signal. It is generally thought that a speech segment will have a short-time zero crossing rate that is lower than a certain threshold, while the noise will be higher than the threshold [
13]. However, the zero-crossing rate of noise in laser detected speech can be low or high, because the causes of the noise differ. Therefore, it is not possible to set an appropriate threshold to distinguish between speckle noise and speech using a zero-crossing rate. In view of the periodic nature of speech, its ACF is also periodic, with the period being equal to the pitch value. ACF shows peaks at various pitch and harmonics locations. Consequently, ACF-based algorithms are efficient in distinguishing between background noise with a small amplitude and speech. However, they are not so effective for speckle noise.
Outside of the above-mentioned, relatively straightforward time-domain and frequency-domain methods, Cristalli [
16] and Lv [
17] have developed a kurtosis-based approach for the detection of speckle noise in laser captured signals. As the kurtosis can measure the degree of deviation from a certain distribution [
18], it can be applied to identify abnormal speckle samples. Their work introduced a kurtosis ratio (KR)—based method for the detection of speckle noise and the selection of undistorted regions within a signal. Their algorithm is composed of band-pass filtering, signal segmentation and computation of a scalar KR indicator for each signal segment, which can detect outlying samples that are caused by speckle noise. However, this method is not very robust for long−term speckle samples because the distribution of these impulsive samples becomes similar. Thus, the method is not effective for long-term speckle noise.
In speech, speckle noise is represented as impulsive noise in speech. Focusing directly upon impulsive noise, Oudre has proposed an Autoregressive (AR)—based impulsive noise detection [
19] and interpolation [
20] method that can be used in the detection phase. He transformed an original noisy signal into an excitation signal, while keeping the impulsive noise either unchanged or increased, by drawing on an AR model in order to increase the detection accuracy. After transforming the direct threshold by the estimated value of the excitation standard-deviation to locate the impulsive noise, the AR model can be used to generate samples to replace the noise samples and obtain an enhanced signal. This approach is very effective and can manage the targeted removal of impulsive noise. However, the impulsive noise detection accuracy is undermined by having to establish a direct threshold, especially if the transformed signal still contains a large amount of background noise.
Recently, some data driven methods have been proposed to suppress noise [
21,
22]. However, they all focus on background noise. For instance, Braun [
23] has proposed a neural network-based architecture for VAD that works on a typical short audio frame basis. While the state-of-the-art neural network based VADs can achieve very good results, they often exceed computational budgets and cannot meet real-time operating requirements. Goyal [
24] has presented a novel method to computationally determine when video data contains a person speaking through the recognition of full-lip facial closures within a given interval. However, the timing of video and sound detection is often not consistent and the problem of noise when processing the voice remains. Therefore, it is not feasible to use mouth movements to detect noise in laser-captured speech.
In the interpolation phase, several methods have been developed for the interpolation of missing samples in music or speech signals, which closely resembles the concern with noise removal based on accurately positioning the noise in speech detection. While some interpolation techniques, such as median filtering, are completely blind (no hypothesis regarding the signal is made) [
25], they are often too crude to reconstruct gaps that are larger than a few samples. The noise frame can be set to zero. However, this will destroy the periodicity of the speech signal, leading to frequency truncation, causing sudden changes in the frequency between the speech frame and the enhanced frame, and resulting in an audible “popping” sound. So, when a noise frame is located, more appropriate methods need to be selected to obtain the enhanced signal.
In summary, although several studies have been devoted to detecting and removing the noise in speech, most approaches focus on eliminating background noise. In the case of single judgment-based methods, the recognition rate for speckle noise is often low. Some novel data-driven methods have produced promising results. However, they are computationally expensive and time consuming. Therefore, they are not suitable for applications that depend on real time processing. Compared with these relevant methods, we fully analyze the characteristics of speckle noise in laser measured speech. On this basis, a novel automatic speckle noise detection and removal method is proposed. This method first foregrounds noise using decorrelation based on a linear prediction (LP) model that improves the NSR of the measured signal. This allows detection of the position of speckle noise through a combination of the average short-time energy and Kurtosis. The method not only precisely locates small clicks (with a duration of just few samples), but also finds the location of longer bursts and scratches (with a duration of up to a hundred samples). The located samples can then be replaced by more appropriate samples whose coding is based on the LP model. This strategy avoids unnecessary processing and obviates the need to compromise the quality of the relatively large fraction of samples that are unaffected by speckle noise. The proposed method has the advantages of high noise positioning accuracy, less distortion, small amount of computation, fast processing speed and low delay, which can meet the use scene of laser speech measurement.
3. Automatic Noise Detection and Removal
In order to avoid unnecessary processing and the distortion that wholesale processing can cause, the proposed speckle noise removal system consists of two subsystems: a detector and an interpolator (cf.
Figure 4). The detector locates the position of each noise sample and the interpolator replaces it.
The speckle noise detector plays a crucial role because accurate positioning is essential for effective noise removal. Speckle noise has certain characteristics, such as a large amplitude, random appearance, agitation, unpredictability, and indefinite duration. In view of these characteristics, a focus on energy and distribution can form the basis of distinguishing between speckle noise and speech. However, as previously mentioned, when there are only few occurrences of speckle noise or the amplitude of the speech is large, the energy of the speckle noise will not stand out. Similarly, if the speckle noise lasts for a long time, its distribution will not seem to be abnormal. Thus, relying on any single parameter will not produce a good result. We have also seen how the character of the speech can itself challenge the detection accuracy. To solve this problem, we have assumed that the speech signal will be correlated while the noise signal will be uncorrelated. We can then begin by decorrelating the measured speech signal. Through decorrelation, the degraded audio signal can be transformed into an extracted signal and the influence of the speech signal and stationary background noise can be eliminated.
For a decorrelated signal, we make full use of the characteristics of the average short-term energy, which is sensitive to long-duration and high-amplitude speckle noise, and the kurtosis, which is sensitive to short-duration abnormal click speckle noise. These two noise detection methods complement each other and can accurately find the location of the speckle noise in a decorrelated signal.
After obtaining the position of the noise frames, we can then use a recursive LP model-based interpolator to replace the signals that are distorted by the speckle noise, one by one. Thus, we finally obtain an enhanced signal. The speckle noise removal process is shown in
Figure 4 and the detailed steps involved in the process are given below.
3.1. Decorrelation
A LP model predicts the future value of a signal from a linear combination of its past values. LP models are used for several applications. The correlation structure of a signal can be modelled using a linear predictor by taking the amplitude of the signal at time
(
), and then using a linearly-weighted combination of
past samples (
,
,…,
):
where the integer variable
is the discrete time index,
is the prediction of
and
are predictor coefficients. The linear predictor coefficients can be calculated by using the Levinson-Durbin algorithm [
26].
The linear prediction model for a signal with an error estimation,
can be expressed as:
Assuming that a clean speech signal
is corrupted by a random additive speckle noise
. The detected signal is given by:
From Equations (1) and (2), we can rewrite the noisy signal model:
In practice, as there is no clean speech signal
, we use the noisy speech
to calculate an estimate
of the predictor coefficient vector
. This can then be used to invert and transform the noisy signal
to the noisy excitation signal
as:
where
is the error in the predictor coefficient estimate. According to Saeed’s [
27] analysis of extracted signals, there are four basic terms that contribute to the noise in an excitation sequence:
(a) the error estimation ;
(b) the speckle disturbance , which is usually the dominant term;
(c) the effect of the past noise samples, run over into the present time by the action of the inverse filtering;
(d) the increase in the variance of the excitation signal, caused by the error in the parameter vector estimate, and expressed as: .
As
is usually the dominant term in a noisy excitation signal
, when a detected speech signal is converted to an excitation signal, the relative energy of the noise in the signal is increased. In other words, the NSR is increased. Before decorrelation, the NSR of a noisy signal is given by:
where
is the expectation operator. After applying the inverse filter, the NSR is expressed as:
The overall gain of the NSR can be obtained by:
This simple analysis demonstrates that an improvement in speckle noise detectability depends on the characteristics of the power amplification of the linear predictor model and the associated resonances.
Figure 5 shows a comparison between a raw measured signal and a decorrelated signal. It can be seen that the speech signal is largely removed and the speckle noise in the signal is highlighted.
3.2. Speckle Noise Detection
The average short-time energy and kurtosis are used as judgment indexes to distinguish the noise. The average short-time energy reflects the mean of the weighted sum of the squares of a frame sample values. The average short-term energy
) of a speech signal at time
is expressed as:
where
is the window length and each window length represents one frame. As the extracted signal enhances the energy ratio of the speckle noise, the average short-term energy can be used to locate the speckle noise with a long duration and large amplitude.
On the contrary, the kurtosis is defined as the fourth central statistical moment normalized by the fourth power of the standard deviation. This describes the degree to which a sample deviates from the distribution:
where
is the kurtosis value of signal
,
contains samples of the
frame
,
and
are respectively the mean value and standard deviation of
, and
stands for the overall average.
Using the kurtosis as a method for locating statistical anomalies provides a strong sensitivity to sudden anomalies with a very short duration and a large numerical value.
From the given definitions of average short-term energy and the kurtosis, the average short time energy is sensitive to high-amplitude noise with a long duration. However, it can easily miss speckle noise with a very short duration and a slightly smaller amplitude. However, the kurtosis is sensitive to sudden anomalies with a very short duration. Thus, the two methods are complementary when applied to the detection of different types of noise in laser-detected speech.
Figure 6a,b show the short-time energy and kurtosis value of a raw detected speech signal, and the signal extracted by decorrelation, respectively. Looking at
Figure 6a vertically and the long-duration speckle noise in the red dashed box, the average energy value is much greater than the voice. However, the average short-term energy value for the sharp speckle noise with a very short duration in the black box is very similar, or even lower than the voice signal. Simultaneously, the kurtosis value of the frame in the black box is very large. This confirms our proposition that the average short-term energy and kurtosis are complementary. If we then make a horizontal comparison between
Figure 6a,b, it can be clearly seen that, by removing the influence of the correlated speech signal, the average short-term energy and kurtosis values of the speckle noise in the extracted signal are greater than they are in the raw signal, confirming that decorrelation can increase detectability of the noise.
Traditional methods usually use a fixed threshold, but as the distance, detection objects and speaker volume can all change, the threshold needs to be reinitialized on each occasion, which reduces the detection efficiency. Therefore, we use the ratio of the present moment to the past average to detect the speckle noise. If the energy or the kurtosis of the current frame is times greater than the average energy , or times greater than the average kurtosis , the current frame is judged to contain speckle noise.
3.3. Coding Samples to Replace Noisy Samples
We have focused on the method for locating every sample containing speckle noise. We then use a recursive LP model-based interpolator to replace the signals distorted by the noise, one by one. The basic idea of an LP model-based interpolator for noise removal can be expressed as follows: “The present value of a speech sample can be approximated by a weighted linear combination of past values of several speech samples” [
27]. Samples irrevocably distorted by speckle noise are discarded and the gap to the left or the right is interpolated. Firstly, the available (clean) samples in the past of a noise pulse are used to estimate the linear predictor coefficients for the linear prediction model of the signal. Afterwards, the estimated model parameters and the samples on the left of the gap are used to interpolate the polluted sample.
For quasiperiodic signals, such as voiced speech, there are two types of correlation structures that can be utilized for an interpolation:
(1) the short-term correlation, which is the correlation of each sample with the immediate past samples , …,.
(2) the long-term correlation, which is the correlation of a sample with similar samples, a pitch period away ,…,.
Due to the disturbances of speckle noise that usually contaminates a relatively small fraction α of all the samples, the length of the samples to be interpolated is short, and the purpose is to remove the influence of speckle noise. Therefore, we do not use the periodic structure for interpolation, but only the interpolation of short-time correlation. That is the linear prediction of a sample
, based on
past samples. This can be defined as:
where
is the encoded sample at the location of the speckle noise,
is the coding order and
are predictor coefficients calculated using the Levinson–Durbin algorithm.
Note that each sample involved in the interpolation is the latest one after the code replacement. The advantage of using an LP model-based interpolator to replace contaminated samples is that it avoids truncation of the signal and keeps the signal consistent. It is not only effective in predicting the content of the signal, but also significantly improves the auditory character of the speech.
3.4. Parameter Setting
In the proposed automatic noise detection and removal method, the signal is divided into overlapping frames of length with a hop size of samples. In practice, we chose = /4, which corresponds to a 75% overlap. If the frames of length are greater than the maximum length of the high-energy noise , the average energy value will incorporate background noise or speech signals with a smaller amplitude. This results in the threshold rate , being hard to set and the noise being missed or erroneously detected. Simultaneously, if the frames of length are too small and they ignore the continuity of an impulse, only the point of the peak will be located, with the middle value of the peaks not being detected. To further complicate matters, in the case of the kurtosis, if the frames of length are too short, applying a statistical process will be nonsensical. In terms of delay, overly-long frames of length will lead to an excessive time delay.
In a laser speech measurement system, the duration of speckle noise is variable but generally ranges between 5 ms and 20 ms. Therefore, we set the frame length to the largest potential value of 20 ms (i.e., 320 samples at a 16 khz sampling rate) in order to fulfill the requirements for the time delay, and to provide a segment length that is sufficient for reliable estimation of the average energy and kurtosis.
As for the coding order
, i.e., how many samples are used to predict and replace the contaminated samples, as the predictor order increases for a speech signal, the prediction error decreases. Saeed [
27] stated that the interpolation error depends on the model order while usually a model order of two to three times the length of missing data sequence achieves good result. Janssen [
28] suggests using
{
. However, the algorithmic complexity of calculating the prediction coefficients also needs to be considered with the increase of the coding order. Indeed, it seems fair to assume that
should be at least greater than
, so that only known samples are used for the reconstruction [
20]. In order to reuse the parameters calculated during the decorrelation preprocessing,
is set equal to the frame length
, because we previously set the frame length as the maximum length of the speckle noise
. This is also advantageous since the linear predictor coefficients calculated during the decorrelation stage can be reused in the last step of the coding, which greatly reduces the computation complexity.
The threshold ratios and directly determine the detection accuracy. Several experiments demonstrated that, when and , all the speckle noise is visibly removed without any false detection.
The value averages and were dynamically adjusted during the experiments in order to assess what would provide sufficient sensitivity with a minimum number of false alarms. Everyone has to take a break in order to breathe when speaking, and, on average, this occurs every 10 s. Therefore, the average energy value and kurtosis for over 10 s were taken as the basis of these calculations.
Figure 7 shows the result of locating and replacing the speckle noise in an example of laser measured speech.
Figure 7a is the raw noisy signal.
Figure 7b shows the noise locations given by the proposed method. The black box shows the noise points located by energy discrimination and the positions marked by an asterisk are the noise points identified by kurtosis.
Figure 7c is the speech enhanced by replacing the noise points. For comparison,
Figure 7d shows the pure voice.