1. Introduction
Deep learning-based sound source separation and reconstruction methods have significantly progressed in recent years [
1,
2]. Relevant speech signal state-of-the-art neural network technologies can be applied in various scenarios, such as speech enhancement and human–computer interactions [
3,
4,
5,
6]. As has been reported, the denoising performance of deep neural networks (DNNs) in signal processing areas is promising in low signal-to-noise ratios and reverberant environments [
7,
8]. However, in certain situations, for example, the active greetings of service robots in supermarkets and residences, accurate physical positions of the sound sources and accurate descriptions of the content of the acoustic signals are required. The spatial information of sound sources can also be useful for signal enhancement and other post-processing approaches. For the aforementioned and other potential scenarios, we propose a hybrid multichannel signal processing approach using a set of microphones for multiple sound source localization, separation, and reconstruction.
Over the past few decades, a variety of sound source separation and reconstruction methods have been developed for automatic speech recognition (ASR) systems. The multi-channel blind source separation (BSS) technique, which is based on independent component analysis (ICA), is one of the most widely used methods [
9,
10,
11,
12]. In a large number of ICA methods, the well-known fast-ICA is one of the most popular BSS techniques for separating and estimating non-Gaussian sound sources without any prior knowledge of the mixing process [
10,
13]. It is based on the optimization of a contrast function that measures the non-Gaussianity of the source. When the non-Gaussian nature of the modified signals is maximized, these modified signals are considered to be independent signals. As a result, the independent sources can be separated by iteratively modifying the mixed signals with a so-called unmixing matrix. The main advantage of the fast-ICA technique is its computational efficiency. These techniques have promising results for the separation of multiple independent sound sources. However, the locations of the sound sources cannot be determined directly after they are applied. The ASR system utilizes the artificial neural network (ANN) or DNN as a classification decoder to transcribe speech signals into words and text. Prior knowledge of speech signals, such as their acoustic features, and the pronunciation dictionary (PD) are required before processing [
9]. Recently, the ASR system has been improved and developed to work with ego noise and to locate the sound source by binaural sound source localization (SSL) [
14]. For the humanoid robot ASR system introduced in [
14], the utilization of spatial information from binaural SSL could significantly improve the accuracy of speech recognition, even in a reverberant environment. As stated in [
14], the spatial information decoded by microphones could be used to achieve multi-perception fusion between the auditory and visual senses of artificial intelligence (AI). However, except for the presence of prior knowledge, the other drawback of this method is that it requires a large number of sound sources.
An alternative to the aforementioned methods is the larger microphone array, which could provide a more accurate coordinate value in the scanning area instead of an incident angle [
15]. Using a large microphone array, the acoustic imaging and signal processing approach, which contains near-field acoustical holography (NAH) and beamforming, has been widely applied in various industrial scenarios [
16,
17,
18,
19]. The NAH technique could map the sound field at every frequency of interest measured by the array. Therefore, a very precise sound source distribution can be provided because of the short-wave radiating distance between the microphone array plane and the sound source plane [
18]. However, in many applications, a microphone array cannot be installed at such a short distance from the source plane. In addition, the area of the measured plane is limited to the dimension of the microphone array.
Various acoustic beamformers are regarded as the most suitable approaches for sound source localization in the far field, such as speech enhancement and human–computer interactions in indoor public areas. Beamforming techniques have been developed for decades [
20,
21]. In addition, an attempt to integrate neural networks (NNs) with acoustic beamformers has proved that NNs could work as front-end in ASR systems for practical SSL scenarios [
22,
23,
24]. Every microphone channel signal of this neural mask-based beamformer is pre-processed through NNs directly, to achieve high recognition rates in far-field scenarios. With respect to acoustic beamformers, in the SSL area, most of the acoustic beamformers are operated in the frequency domain for more accurate localization performances and shorter processing durations, such as in the widely used, linearly constrained minimum variance (LCMV) method, the multiple signal classification (MUSIC) technique, and the well-known deconvolution methods, including the CLEAN algorithm and the deconvolution approach for the mapping of acoustic sources (DAMAS) algorithm [
25,
26,
27,
28,
29,
30]. However, to simultaneously acquire the signal features of sound sources in the time domain with their spatial information, the use of the conventional delay-and-sum (DAS) beamforming algorithm is regarded as a prerequisite so that the localization, separation, and reconstruction approaches of the multiple sound sources can be achieved concurrently [
20]. The time-domain DAS beamforming algorithm utilizes a weighting function to compensate for the time delay in each measurement channel, and the compensated signals reinforce each other so that their summation can be maximized [
31]. Wang et al. proved that a roughly estimated time series could be obtained at each scanning point with two sound sources in the scanning area, experimentally, by weighting different time delays and compensating for distance attenuations for each measurement channel [
32]. Thus, after the location of the sound source is determined by calculating the maximum summation of the compensated signals on the scanning plane, a preliminary estimated signal of the dominant sound source can be obtained. Based on this, the main logic of the proposed signal processing approach is that, first, the location and rough time-domain features of the dominant sound source can be identified and characterized by a time-domain DAS beamformer. Then, the denoising approach is performed using supervised learning DNNs [
33] so that the dominant sound source can be reconstructed. Subsequently, the signals from the dominant sound source are subtracted from the originally received signals, and the initial beamforming sound map is updated. Since the dominant sound source is localized and reconstructed, the received signals of all channels will be subtracted by this dominant source signal to perform the next iteration. By repeating this loop, multiple sound sources in the sound field can be localized and reconstructed. As the time consumption of time-domain DAS beamformer is extremely high, to achieve the real-time localization and separation performance, the computing speed could be improved by using a frequency-domain beamformer to estimate the initial location of the dominant sound source. Considering the broadband feature of the speech signal, a broadband weighted multiple signal classification (BW-MUSIC) method is used in this study to balance the selection of the computed frequency and computation time [
34]. Moreover, to make full use of the feature information contained in the speech, four parameters are extracted from the pending signal: the mel-frequency cepstral coefficients (MFCC), amplitude modulation spectrum (AMS), gammatone filter bank power spectra (GFPS), and relative spectral transformed perceptual linear prediction coefficients (RASTA-PLP) [
35,
36]. A schematic of the proposed method is shown in
Figure 1.
Section 2 describes the proposed algorithms and theoretical methods used in this study.
Section 3 and
Section 4 present the simulations and experimental studies, respectively, with the corresponding discussions.
3. Simulation Results
To verify the proposed signal-processing approach, numerical simulations were performed using MATLAB. As shown in
Figure 4a, three point sound sources were located at (0.7 m, 1.4 m), (0.6 m, 0.4 m), and (1.2 m, 0.4 m), and an annular simulative array with eight receivers was adopted around the sources. The directivity pattern of the microphone array is shown in
Figure 4b by assuming a single frequency point source signal in the geometric center of the scanning area.
The eight-channel measured mixed signal, which is shown in
Figure 5, consists of three randomly selected signals from the ST-CMDS data testing set that do not contain any additional noise.
3.1. Traditional Time-Domain Beamforming
Figure 6 shows the simulated beamforming map obtained by the DAS beamformer using (4) before the first iteration. The hotspot in
Figure 6 indicates the location of the dominant sound source,
s1. Because the sound power level (SWL) of
s1 was relatively higher than those of
s2 and
s3, the locations of the sound sources of
s2 and
s3 can barely be observed in
Figure 6. Thus, the spatial information from the sound map and the preliminary estimation of
s1 by time-domain DAS is regarded as an initial value, and it can be easily removed from the mixed signal by the proposed method. A virtual measurement signal is created such that the first dominant sound source,
s1, does not exist.
Figure 7 shows the beamforming maps obtained after removing
s1. To verify the effectiveness of the denoising DNNs, the performance of the DAS beamformer with and without the DNNs was determined through controlled experiments as well. The location of the dominant sound source,
s2, in a virtual upgraded sound area is shown in
Figure 7a,b. The effectiveness of adding the DNNs before removing
s1 was not remarkable in this step.
To further separate the measured mixed signal,
s1 and
s2 were both removed by analogous processes, while
Figure 8 shows the corresponding beamforming maps. In
Figure 8a, which shows a beamforming map obtained without pre-processing by DNNs, the location of
s3 cannot be identified successfully because of the interference caused by the error generated after removing
s1 and
s2 from the mixed signal. This problem was resolved by DNN pre-processing, as shown in
Figure 8b, where the location of
s3 in the virtual upgraded sound area can now be observed. The results of virtual sound area reconstruction and the corresponding source locations shown in
Figure 7b and
Figure 8b verify the effectiveness of DNN denoising.
3.2. Weighted Broadband MUSIC
As the high computation time burden of the time-domain DAS beamformer conflicts with the real-time localization and separation requirements, an alternative localization approach, the so-called BW-MUSIC method, was utilized to balance the broadband feature of the speech signal and the real-time operation performance. In the simulation cases, the sampling rate of the speech signals from the ST-CMDS data set was 16,000 Hz; therefore, the analysis frequency band was set to 50–8000 Hz, and the step length was 50 Hz for the BW-MUSIC method. The total processing duration for one sound source was 78.037 s, measured by MATLAB with an Intel i7-6700 central processing unit (CPU) and 44 GB of random access memory (RAM).
The improved localization results obtained with the BW-MUSIC method are shown in
Figure 9, in which the locations of
s1,
s2, and
s3 are denoted in the beamforming maps conspicuously by the virtual upgraded measurement sound signal. Compared with
Figure 6,
Figure 7b and
Figure 8b of the DAS beamformer, the localization results of BW-MUSIC in
Figure 9a–c are much more accurate, with minor main-lobe-to-side-lobe ratios (MSRs) and higher resolution. More specifically, owing to the error elimination by the weighting summation and averaging calculation in the BW-MUSIC method, the sound sources could be significantly identified by the main lobes, which are the distinct hotspots in all the three figures, while the areas of the hotspots are smaller.
3.3. Performance Evaluation of the Localization Results
Table 1 shows the errors in the absolute distance of localization for different simulation methods. A maximal error of 0.79 m occurs in the DAS beamforming map after removing
s1 and
s2 without DNNs. In accordance with the localization results shown in the figures, the DNN-based beamforming method provides a smaller error, and the localization results can be observed to be accurate.
It is worth mentioning that the virtual measurement signal could not be constructed perfectly in the time domain. This means that the residual error after removing the dominant sound source could still be comparable in the upgraded sound area. Therefore, the location of
s1 can still be observed in
Figure 9b,c. The same conclusion can be drawn from the signal reconstruction section.
As the interference in the beamforming maps is usually generated by two other sound sources instead of side-lobes, in this case, the common MSR may lead to an inaccurate evaluation to describe the localization performance. Therefore, a parameter called the main-to-second-lobe level (MSEL) was defined and utilized in this study:
where
Lm is the height of the main lobe, and
Ls is the height of the second lobe generated by other sources.
Figure 10a shows the MSEL results of the three sources using different approaches. First, the MSEL of the beamforming maps of the third source obtained by DAS without DNNs is negative, which means that this method fails to locate the third sound source, as shown in
Figure 8a. The DNNs could significantly improve the MSEL in the beamforming maps after removing
s1 and
s2. Second, the MSEL of the BW-MUSIC method is not always higher than that of the DAS method, even though the BW-MUSIC method could provide distinct beamforming maps.
The effective SNR ranges of the different approaches were evaluated by MSEL as well. Since no additional noise was added in this case, the decay rate between the two adjacent sound sources was utilized to describe the SNR. For example, assuming that the primary amplitudes of the three sound sources were
A1,
A2, and
A3, the actual amplitudes of the three sound sources in the simulation would be set to
A1, 0.8
A2, and 0.64
A3 when the decay rate is 80%.
Figure 10b shows the MSEL results of the beamforming maps after removing
s1 and
s2, which vary with the decay rate. Apparently, the DAS without DNNs method could not locate the third sound source at all the SNR ranges, while the other two approaches failed when the decay rate was lower than 20%. According to the curves shown in
Figure 10b, a decay rate higher than 40% could be a reasonably effective SNR range for the DAS with the DNNs method. For BW-MUSIC with DNNs, the criterion should be higher than 60%.
3.4. Signal Separation and Reconstruction
The DNN-processed signals were extracted and removed from the mixed signal simultaneously. The performance of the reconstruction and separation of the original signals is shown in
Figure 11.
Figure 11(1a) shows the original
s1 from the ST-CMDS data testing set;
Figure 11(1b) shows the reconstructed
s1 using the proposed method;
Figure 11(1c) shows the cross-correlation function between them. The same dispositions are adopted for
s2 and
s3 in
Figure 11(2a–c,3a–c), respectively. It can be noticed that the signals
s2 and
s3 contain rare information in the 0–0.5 s range in
Figure 11(2a,3a). However,
s1 contains a large amount of information within the same range. Under such conditions, the three mixed signals were well separated and reconstructed. Only a few residual errors could be observed in the 0–0.5 s range in
Figure 9(2b,3b). Additionally, the sharp peaks close to 1 in the cross-correlation functions shown in
Figure 11(1c,2c,3c) indicate that the two corresponding signals are strongly coherent with each other, while no time shift can be found at the horizontal ordinate.
4. Experimental Study
Experimental validation was carried out in a semi-anechoic chamber of the Institute of Vibration, Shock, and Noise (IVSN) at the Shanghai Jiao Tong University (SJTU). The background noise level of the semi-anechoic chamber is 15.6 dB (A), and the cut-off frequency is 100 Hz. The floor of the semi-anechoic chamber is solid, which could act as a work surface for supporting heavy items. In the experiments, a 56-channel spiral array with 40 enabled microphones was placed 2.404 m in front of the measurement plane. Three Philips BT25 Bluetooth loudspeakers were arranged in the measurement plane as sound sources. The acoustic signals were captured by Brüel & Kjær 4944-A microphones, which were calibrated by a Brüel & Kjær 4231 94 dB sound pressure calibration before the measurements. The sound pressure analog signals were converted into digital signals using a 42-channel Mueller-BBM MKII sound measurement system, and the sampling rate was 16,000 Hz. A snapshot of the experimental setup is presented in
Figure 12.
As shown in
Figure 12, the geometrical center of the spiral array, which was at a height of 1.39 m from the floor in the experiments, was set as the coordinate origin. Three sound sources, which were 1.752 m, 1.128 m, and 1.419 m from the floor, were supported by tripods. A laser level and the band tapes with a 0.001 m accuracy were utilized to measure the sound source locations. Accordingly, considering the length of the microphones, the coordinates of the three sound sources were (−0.800, 0.029, and 2.304 m), (0.007 m, −0.262 m, and 2.304 m), and (0.793 m, 0.362 m, and 2.304 m) in a near-field Cartesian coordinate system. The plane of the microphones was set as the x–z coordinate plane. Three randomly selected data pieces from the ST-CMDS data testing set were broadcast by loudspeakers on an endless loop in the measurements. A 40-channel mixed signal is shown in
Figure 13.
The beamforming maps of the localization results obtained by the proposed process are shown in
Figure 14. By adopting the BW-MUSIC method and denoising DNNs, the localization performance was found to be acceptable with obvious main lobes, which indicate the location of the sound source. The side lobes could barely be noticed in all the three figures. However, the area of the hotspots was larger than in the simulation results. As the experimental chamber is semi-anechoic, the lower resolution in
Figure 14, compared to
Figure 9, can be attributed to the ground reflection of sound waves.
Table 2 shows the corresponding errors in the absolute distance of localization by BW-MUSIC. It is indicated that the three sound sources are localized accurately, with a location error of less than 0.15 m.
A comparison of the original signals from the dataset and the signals reconstructed by the proposed method is shown in
Figure 15. The reconstructed signals shown in
Figure 15(1b–3b) prove that the proposed method can successfully separate the three sound sources from the mixed measurement signal. However, the apparent discrepancy between the original and reconstructed signals indicates a worse performance of the signal reconstruction in the experimental study compared with the simulation results. As background noise is generated from the sound measurement system and the ground reflection interference of sound waves, in future work, a more robust denoising DNN for a lower SNR will be explored to improve the reconstruction performance.
5. Conclusions
The aim of this paper is to present a conceivable approach for real-time, multiple sound source, synchronistical localization, separation, and reconstruction. Its performance regarding the processing of speech signals from the ST-CMDS dataset was investigated through simulations and experiments. The major conclusions of this study are as follows.
1. By adopting the beamforming technique and denoising DNNs by numerical simulations and experimental studies, it is feasible to localize and separate the mixed multiple sound sources of speech signals in the time domain. The sound map of a DAS beamformer is formed by the superposition of the contributions of all sources in the sound field. Sources with lower SWLs may be masked by other sound sources with higher SWLs. Accordingly, compared with the traditional beamforming methods for multiple sound source localization, the proposed approach is quite suitable for a sound field containing a dominant sound source with masked weak sound sources, owing to the signal removal and virtual sound field reconstruction iteration. A decay rate higher than 60% could be an effective SNR range for the proposed method.
2. The time-domain signals were estimated and denoised using the DAS beamformer with DNNs. However, in reality, the computational cost of the time-domain DAS beamformer is unsuitable for potential applications. Thus, the localization procedure in the proposed approach is improved by the BW-MUSIC method, which is more flexible for broadband speech signals with an adjustable operational speed in the frequency domain. In particular, the accuracy and resolution of the sound source locations are enhanced by the BW-MUSIC method. The sound sources are localized accurately with a location error less than 0.001 m in the numerical simulations, and 0.15 m in the experimental study. The temporal characteristics of the source signals are well-extracted from the measurement signals during simulations, yet the signal reconstruction performance in the experiments is worse because of the ground reflection of sound waves and background noise.
3. The proposed approach could provide a potential criterion for determining whether a small hotspot in a beamforming map is the main lobe of a weaker source or the side lobe of a stronger source at another place, as each main sound source could be plotted on the beamforming map separately by iterations. This can be regarded as an alternative approach for source number estimation in a given sound field.