1. Introduction
In real-world environments, speech is always corrupted by background noise, which severely degrades speech intelligibility for human listeners and makes downstream tasks such as automatic speech recognition (ASR) more challenging. Therefore, speech enhancement, which suppresses the background noise to get clean speech, has been extensively studied for decades and numerous methods have been proposed in this field.
Two primary approaches exist for speech enhancement: single-channel and multi-channel. Single-channel speech enhancement operates on a single microphone input, while multi-channel speech enhancement utilizes multiple microphone inputs to enhance the speech signal. In this study, we mainly focus on single-channel speech enhancement, as single-channel speech signals can be collected using only one microphone, making them more common in real-life scenarios.
In the context of single-channel speech enhancement, traditional methods include spectral-subtraction algorithms [
1], Wiener filtering [
2], non-negative matrix factorization [
3], spectrogram inversion [
4], etc. These methods are generally computationally efficient and exhibit good domain generalization. However, they assume that the background noise is stationary, that is, its spectral and temporal characteristics remain constant over time, enabling accurate estimation of its statistical properties. In the presence of non-stationary background noise, however, accurately capturing its statistical properties becomes challenging as the noise continuously changes over time, leading to a substantial performance degradation of these methods [
5].
In recent years, methods based on Deep Neural Networks (DNN) have shown their superior capability in dealing with non-stationary noise compared with traditional methods in both single-channel and multi-channel speech enhancement. These methods typically train DNNs to learn a mapping from noisy speech to clean speech. The speech enhanced by these methods tends to achieve high enhancement metrics scores. However, a recent study [
6,
7] has shown that the DNNs can introduce distortions into the enhanced speech, which will cause performance degradation of downstream tasks, such as ASR.
To minimize such distortions, in multi-channel speech enhancement, recent studies [
8,
9] have combined DNNs with low-distortion multi-channel filters. These studies have achieved excellent results not only on speech enhancement but also on the downstream ASR task.
Although [
8,
9] have demonstrated that low-distortion filters can improve the performance of neural networks in multi-channel scenarios, it should be noted that in multi-channel scenarios, filters can exploit the spatial information inherent in microphone arrays, which is not accessible in single-channel scenarios. Therefore, inspired by these studies, we aim to investigate the performance of combining DNNs and conventional filters on single-channel speech enhancement and the downstream ASR task. We adopt a similar two-stage framework to [
8], since we believe single-stage networks may suffer from performance bottlenecks in recovering clean speech from degraded ones when faced with challenging scenarios.
The main difference between our method and the one proposed in [
8] is that our method focuses on single-channel speech enhancement, so the networks and filters in our method are designed for single-channel speech signals. Specifically, our method consists of two DNNs, which have the same architecture but do not share the parameters, and a single-channel filter module. In the first stage, the first DNN is trained to generate an enhanced spectrum from the noisy spectrum. In the second stage, the first DNN is fixed, and its output is used to compute the single-channel filter. The filtered result and the output of the first DNN are used as extra features to guide the training of the second DNN for better enhancement.
The rest of this study is organized as follows. In
Section 2, we introduce related research. The framework and DNN architecture of the proposed method, the training strategy, and the single-channel filter are introduced in
Section 3. In
Section 4, the experimental results and analysis are provided. Conclusions are given in
Section 5.
3. Method
3.1. Framework and Network Architecture
The proposed system is illustrated in
Figure 1, with the objective of removing background noise while minimizing distortion, a more complex problem than simple noise removal. To tackle this challenge, we adopted a problem decomposition approach and integrated two neural networks,
and
.
aims to suppress noise components coarsely, while
is responsible for further noise removal and minimizing distortion introduced during enhancement.
To this end, we adopt the following two measures. Firstly, we introduced a filter module that calculates a low-distortion filter based on the enhancement result of (), and applies it to the noisy spectrum (Y). The filtered output () is then concatenated with and fed into as extra features. The purpose of this approach is two-fold. First, since the single-channel signal has no spatial information that can be exploited, we hope the result of multi-frame filters can convey the interframe correlation to . Second, we wish the low-distortion filtered result can provide information that is complementary to the result of , thus helping to produce less distorted speech, which is benifical to downstream tasks such as ASR.
Secondly, we concatenate
Y with
and
as the input for
, inspired by [
6]. Through theoretical analysis and experiments in [
6], it was shown that adding a scaled version of the noisy signal to the enhanced signal can monotonically increase the signal-to-artifact ratio under mild conditions and improve ASR performance. Thus, we believe that concatenating
Y with
and
can provide essential information for
to reduce distortion during the training process.
We employ the TCN-DenseUNet described in [
8] and modify it into a single-channel version for
and
. Although both
and
adopt the TCN-DenseUNet structure, it should be noted that they do not share parameters. The main reason is that
and
focus on different tasks, while
is concerned with removing background noise as much as possible,
aims to make the most of the filter output and noisy speech to minimize distortion. We believe that not sharing parameters between the two networks can lead to better performance, which is supported to some extent by the experimental results in
Section 5.4.
TCN-DenseUNet is a variant of U-Net, with a temporal convolutional network (TCN) network inserted between the encoder and decoder. The DenseNet blocks are also inserted between different layers of the encoder and decoder of the U-Net.
Figure 2 shows the diagram of the TCN-DenseUNet. The encoder contains a 2D convolution layer and seven convolutional blocks, while the decoder contains seven deconvolutional blocks and a 2D deconvolution layer. Skip connections are added between the encoder and decoder. Each convolutional block consists of a 2D convolution layer, an exponential linear units (ELU) nonlinearity layer, and an instance normalization (IN) layer. The deconvolutional block has the same structure as the convolutional block, except that the 2D convolution layer is replaced with the 2D deconvolution layer. The DenseNet blocks consist of five convolutional blocks and the TCN network contains four layers, each with seven dilated convolutional blocks.
The detailed setup for the TCN-DenseUNet is also shown in
Figure 2. DenseNet block is represented by DenseBlock
, where
and
are the growth rates for the first four and the last convolutional block, respectively. Other convolutional blocks are represented in the form of
, where
are the kernel size, stride, padding, and output channels, respectively.
3.2. Two-Stage Training Strategy
A two-stage training strategy is employed to train the system. To be specific, in the first stage, only is trained. The input is the noisy spectrum , while estimates the spectrum of the clean speech. In the second stage, the well-trained is fixed, and its output is first fed into the error to calculate a multi-frame filter . is then applied to to get the enhanced spectrum . Finally, , , and are concatenated and fed into . then outputs the final estimated spectrum of the clean speech. In a nutshell, the training stage can be formulated as:
3.3. Multi-Frame MVDR
The MFMVDR was first proposed in [
12]. It considers the correlation between consecutive time-frames to obtain better enhancement performance.
The signal model for MFMVDR is as follows:
where
,
, and
are the noisy speech, clean speech, and the additive noise, respectively. Using STFT, (
5) can be rewritten as:
In order to model the interframe correlation, a
L-dimensional noisy speech vector
is defined as:
where
L is the number of frames used to calculate the MFMVDR filter.
and
can be defined similarly.
The formula of MFMVDR is as follows:
where
is an
identity matrix,
is the first column of
,
is the trace operation, and
By viewing
as
, we can calculate
. Then, we can calculate
using Equation (
10), and finally, we get the MFMVDR filter.
It should be noted that calculating
using Equation (
10) yields a matrix of rank 1, i.e., a singular matrix, but in Equation (
8) the inverse matrix of
is required. To address this issue, we use the following methodology in this study. First, we calculate
and
using the following equations, instead of Equations (
9) and (
10):
where
and
are the forgetting factors. Second, we apply diagonal loading [
23] to
and
to improve the robustness of training stage.
4. Experiment
4.1. Dataset and Evaluation Metrics
Experiments are performed on the WHAM! [
24] dataset. WHAM! was originally designed for speech separation in noisy environments. It pairs each two-speaker mixture in the wsj0-2mix [
25] dataset with a real-world noise. The noise was recorded in urban environments, such as coffee shops, restaurants, bars, office buildings, parks, etc. [
24]. Meanwhile, WHAM! also provides a version for speech enhancement, where only the speech of the first speaker is mixed with the noise at SNRs randomly sampled between −6 and +3 dB. The training set, development set, and test set contain 20,000, 5000, and 3000 utterances, respectively. The training and development sets share common speakers, but the test set speakers are different. In order to evaluate the generalizability of the proposed model, we further adopt the one-channel test set from CHiME-4 [
26] for evaluation. Both the simulated and the real-world noisy utterances are used.
Four widely used objective metrics are used to evaluate the enhanced speech, namely narrow-band Perceptual Evaluation of Speech Quality (PESQ-NB) [
27], Short-Time Objective Intelligibility (STOI) [
28], scale-invariant Source-to-Noise Ratio (SI-SNR) [
29] and Signal-to-Distortion Ratio (SDR) [
30]. All of these metrics are the larger the better. The Word Error Rate (WER) is used to indicate the performance of the enhanced speech on the ASR task.
4.2. Baselines and Experimental Settings
The proposed model is compared with several popular baseline models, including LSTM, Conv-TasNet, and DPRNN.
Figure 3 shows the diagrams of these baseline models, and detailed descriptions of all the models are provided below.
LSTM: Three-layers of bi-directional LSTM, where each layer has 512 hidden units, followed by a linear layer with 257 output units and tanh activation function. A Dropout layer with dropout probability equal to 0.4 is introduced on the outputs of each LSTM layer except the last layer.
Conv-TasNet: The encoder and decoder are symmetric 1D convolution layers. The mask estimator comprises a layer normalization and a 1D convolution layer with a kernel size of 1 (1 × 1-conv block) and 256 output channels. It is then followed by 8 convolutional blocks with a kernel size of 3, output channels of 512, and dilation factors ranging from 1 to , which are repeated 4 times. Ultimately, a 1 × 1-conv block with a ReLU activation function is employed to estimate the mask.
DPRNN: The encoder and decoder are symmetric 1D convolution layers. Six DPRNN blocks described in [
21] are used to predict the mask. The DPRNN blocks utilize intra-block and inter-block RNNs, both of which adopt residual connections. These RNNs consist of bidirectional LSTMs with 128 hidden units, linear layers with 64 output units, and layer normalization. A dropout layer with a 0.1 dropout rate is inserted between the LSTM and linear layer.
All utterances are resampled to 8 kHz. For T-F domain models used in the experiments, STFT with frame size of 64 ms, frame shift of 16 ms, and a 64 ms Hanning window is employed to extract the features. For Conv-TasNet the encoder is a 1D convolution layer with kernel size 40, stride 20, and 256 output channels. For DPRNN, the kernel size and stride for the convolutional encoder are 2 and 1, respectively, and the number of output channels is 64.
The LSTM takes the magnitude of the speech as input and estimates the magnitude mask. The batch size is set to 16. The Conv-TasNet and DPRNN accept raw waves as input and directly output enhanced waves. The batch size for Conv-TasNet and DPRNN are 8 and 4, respectively. The input for the proposed model is the stacked RI components of the noisy spectrum and the output is the estimated RI components. The batch size is set to 16.
All the models are optimized using Adam [
31] with a learning rate of 1.0 × 10
. The negative Source-to-Noise Ratio (SNR) is used as the loss function. The maximum number of training epoch is 100. For T-F domain models, the training will stop if the loss is not decreasing on the development set for 10 consecutive epochs, and for time domain models, this number is 4. We set the forgetting factors
and
to 0.6 in this study.
5. Results and Discussion
5.1. Effect of the Padding Mode on Multi-Frame Filters
The enhancement performance of the multi-frame filters produced by the Filter Module has an impact on the proposed system. To improve the performance of the whole system, we need to set appropriate hyperparameters for the filters. Since the MFMCWF filter achieves better performance than the MVDR filter in [
9], in this study, in addition to the MFMVDR filter, we also tried to let the Filter Module output the multi-frame Wiener filter (MFWF) filter, hoping to explore the performance gap between the two under single-channel conditions.
There are two hyperparameters for the filters: one is the padding mode and the other is the total number of frames. Here, the padding mode means the number of frames on the left of the current frame when the total number of frames is determined.
In order to find the appropriate padding mode for multi-frame filters, we set the total number of frames to 17 and feed the enhanced results from stage 1 and the noisy speeches to the Filter Module. The calculated filter is used to enhance the noisy speeches.
Table 1 shows the effect of the padding mode on MFMVDR, the impact on MFWF has the same trend, for the sake of brevity, the corresponding table is not given here. In the first row, all the frames are padded on the right side of the target frame, which means the system only uses future information, similarly, the last row means the system only uses history information. All the other rows mean the system uses both history and feature information. From
Table 1, we can tell that using both history and future information can make the filter have better enhancement performance. Since the results in
Table 1 are obtained when the total number of frames is set to 17, without loss of generality, we decided to pad the same number of frames on both sides of the target frames in subsequent experiments.
5.2. Effect of the Total Number of Frames on Multi-Frame Filters
After determining the padding mode, we further explore the effect of the total number of frames on multi-frame filters, since the total number of frames used to calculate the filters represents the context information. If the total number of frames is too small, the available information will be limited, which will affect the enhancement performance. However, when the total number of frames increases to a certain size, the computing overhead will exceed the performance gain.
Table 2 shows the effect of the total number of frames on MFMVDR and MFWF. The first two rows are the enhancement metrics scores of the noisy speech and the speech enhanced by
. Both filters are calculated using the enhanced speech from the first stage and are used to process the original noisy speech. From the results, we can observe that both filters perform better as the total number of frames grows. It should be pointed out that due to the limitation of computing resources, we only explored the case of up to 13 frames. When the total number of frames is further increased, whether the enhancement performance will be improved remains to be verified. Another observation is that MFWF performs much better than MFMVDR for the same total number of frames. The table also shows that the enhancement results of DNN 1 are significantly better than the two filters, which is reasonable because the input used to calculate the required parameters for both filters is the enhanced results of DNN1 rather than clean speech, which leads to an accumulation of errors.
5.3. Effect of Filter Type on System Performance
Although the results in
Table 2 show that MFWF outperforms MFMVDR when used alone, the effect of both on the overall system performance needs to be further verified. For this purpose, we trained two models on WHAM!, both of which have the same configuration except for the different types of filters used. Both filters set the total number of frames to 13, with 6 frames on both sides of the target frame.
The performance of the two models on the WHAM! test set is shown in
Table 3. The WER metric in the table is obtained by feeding the speech enhanced by the model to an ASR system. The ASR system used in this study is a joint Connectionist Temporal Classification-Attention (CTC-Attention) model trained on wsj0 corpus resampled to 8 kHz. In the decoding stage, a word-level language model is used to improve the decoding results.
It is observed that the system using MFMVDR as the filtering module performs slightly better than the one using MFWF, which is different from the conclusion in [
9]. The scores on the enhancement metrics are the same for both, but the system using MFMVDR scores better on the WER metric than the one using MFWF. We suppose this may be related to the non-distortion property of MFMVDR; the model may learn this property from MFMVDR in the second training stage so that the enhancement results of the system contain less distortion. As the system using MFMVDR performs better, all models in subsequent experiments will use the MFMVDR filter.
5.4. Effect of the Framework on System Performance
The framework shown in
Figure 1 can integrate MFMVDR information into
, but incorporating two neural networks in the framework results in a relatively large number of parameters for the model. A possible modification could be to use a single neural network instead, reducing the total number of parameters in the entire system by
; the framework is shown in
Figure 4. It is worth noting that we have depicted two neural networks in
Figure 4 to better illustrate the training process; however, both of these networks are, in fact, identical.
The training of this framework still adopts a two-stage strategy. In the first stage, the noisy spectrum Y are concatenated with two zero matrices of the same shape and fed to the DNN to estimate the clean spectrum. In the second stage, two iterations are required to produce the final output. In the first iteration, similar to the first stage, Y, concatenated with two all-zero matrices, is fed into the DNN, and the initial estimate of the clean spectrum, , is obtained. and Y are used to calculate the filtered spectrum . In the second iteration, the noisy spectrum Y, the result of the previous iteration , and the filtered spectrum are concatenated and fed into the neural network to obtain the final output .
Table 4 displays the performance of systems utilizing the two different frameworks on the WHAM! test set. The results indicate that the framework using two DNNs outperforms the one using only one DNN. We hypothesize that there are two possible reasons for this observation. Firstly, it may be challenging for a single neural network to learn both noise reduction and the utilization of MFMVDR results to reduce distortion in the final output. Secondly, it is more probable that when using only a single neural network, the network parameters keep changing, which cannot ensure the accuracy of the input provided to the filter module. In contrast, a dual neural network-based framework can ensure the accuracy of the input provided to the filter module, resulting in superior filtered results and ultimately improving the overall performance.
Given the results shown in
Table 3 and
Table 4, the system proposed in this study is based on the framework utilizing two neural networks and employs MFMVDR as the filter module.
5.5. Comparison with Baselines on WHAM!
Table 5 shows the performance of the proposed model and baseline models on the test set of WHAM!. From the table, we can see that compared to noisy speech, speech after enhancement achieves significant improvement in WER, which to some extent illustrates the help of speech enhancement for back-end ASR tasks. The proposed model performs significantly better than the LSTM model in all the metrics, even though both of them are time-frequency domain models and the number of parameters of them is close to each other.
The proposed model has a larger number of parameters than Conv-TasNet, which is reasonable as the latter is a time domain model and usually has a smaller parameter size. Though Conv-TasNet contains half as many parameters as the proposed model, there is still a large gap between its performance and the proposed one, especially on the WER score.
Thanks to its dual-path structure, DPRNN can deliver high-quality enhancement results with a limited number of parameters and performs well in terms of WER metrics. Despite having a larger number of parameters than DPRNN, the proposed model requires less memory and time to train a single epoch during the training phase. Moreover, it outperforms DPRNN in all metrics.
5.6. Ablation Study
We also perform an ablation test to evaluate the effectiveness of the two-stage training strategy and the incorporation of MFMVDR results. The results are shown in
Table 6. The first row corresponds to a single TCN-DenseUNet model. The second row uses the MFMVDR as a post-processing module for the TCN-DenseUNet. The model in the third row only takes the noisy input and the enhanced speech from the first stage as inputs in the second training stage. The fourth row is the results of the proposed model.
By comparing the first and fourth rows, we can observe that our chosen base network, TCN-DenseUNet, achieves strong performance, but there is still room for noticeable improvement by training it in two stages and fusing the MFMVDR information. When we compare the first and second rows, it can be seen that post-processing the output of the neural network with the MFMVDR filter alone does not work and even leads to a significant drop in all metrics of the processed speech. However, by adopting the two-stage training approach as shown in the first and third rows, we can achieve improvements in both enhancement and WER metrics. Furthermore, comparing the third and fourth rows, it becomes apparent that combining the information from the MFMVDR in the second training phase can further improve the performance of the system. This finding suggests that the results of the MFMVDR can also contribute to the overall system’s performance improvement.
5.7. Generalizability of the Proposed System
To evaluate the generalization of the proposed system, we used the system trained on the WHAM! dataset to directly enhance the noisy speech from the CHiME-4 dataset; the results are shown in
Table 7. The proposed system outperforms the other methods significantly on all the metrics for both simulated data and real-world data, indicating its superior generalizability.
6. Conclusions
In this study, a two-stage model for single-channel speech enhancement is proposed. There are two DNNs in the proposed model—in the first stage, one of the two DNNs is trained first. In the second stage, the other DNN uses the enhanced speech from the trained DNN as extra input features. To further improve the enhancement performance and reduce the distortion introduced by neural networks, the result of a single-channel filter is also used in the second stage to guide the training of the model. Two different single-channel filters are investigated in this study, namely, MFMVDR and MFWF. We investigate the influence of the number of frames used to calculate the filter and the padding mode on the performance of the two filters. We also compare the impact of MFMVDR and MFWF on the final model performance and find that MFMVDR delivers more improvement.
Our main contribution is proposing a two-stage training approach in the single-channel scenario, which utilizes the information from MFMVDR filters to assist in neural network training. As a result, our method achieves improved speech enhancement performance and higher speech recognition accuracy. Experiments on two datasets containing both synthetic and real-world noisy speech show that the proposed model has better enhancement performance and generalization ability. On the WHAM! test set, our model exhibited a relative improvement of 3% in SI-SNR and 2% in WER compared to the best-performing baseline model, DPRNN. On the synthetic test set of CHiME-4, our model demonstrated substantial relative improvements of 40% and 20% in SI-SNR and WER, respectively. Additionally, our model exhibited a noteworthy relative improvement of 9% in WER on the more challenging real test set of CHiME-4. The ablation study demonstrates the effectiveness of the two-stage training strategy and the incorporation of MFMVDR results in training.
Possible directions for future improvements include:
Smaller model parameters. In this study, we employ TCN-DenseUNet, a time-frequency domain model, for both and . Our future research will explore the use of time-domain models, such as DPRNN, as the structure for and to reduce the number of model parameters.
Improved training strategies. For example, using the parameters of to initialize in the second stage, followed by fine-tuning , or jointly training and with using a smaller learning rate to speed up model convergence and further improve performance.
More refined information fusion methods. In this study, the input of is a simple concatenation of , , and Y. In future research, more sophisticated methods can be explored to fuse the information of the three, such as attention mechanisms.
Diversified model selection. We could make DNN1 or DNN2 a time-frequency domain model, and the other a time-domain model. It is hoped that these two models can complement each other to achieve better enhancement performance.