1. Introduction
Speech signal processing is considered to be a principal area in the intelligentization of robot–human communication systems [
1]. The interface between humans and robots is used in applications such as human rescue, war and defense environments, earthquakes, unexpected events, factories, smart control systems, etc. Therefore, a wide range of signal processing techniques are included in this area such as simultaneous sound source localization (SSL) [
2], speaker tracking [
3], speech signal enhancement [
4], speaker identification [
5], voice activity detection [
6], speaker counting [
7], etc. Speaker tracking and source localization are two common speech processing algorithms, which are widely used in intelligent room implementations. Speech overlaps and fast changes between active speakers are important challenges in localization and tracking systems. Most of the localization algorithms require knowledge of the correct number of speakers for calculating accurate speaker locations. Speaker counting methods are implemented as a preprocessing step for many other speech processing algorithms. Therefore, the accuracy of speaker counting algorithms directly affects the precision of the speech processing methods. Speaker counting methods are highly affected by environmental conditions such as noise, reverberation, environmental temperature, etc. [
8]. Methods that properly estimate the number of speakers (NoSs) must calculate the correct number of speakers in the worst acoustical scenarios. Aliasing and imaging are other factors that decrease the precision of NoS estimating methods because of inter-sensor distances. Microphone arrays provide the proper information about different indoor environments for signal processing due to the high number of sensors, but aliasing and imaging, because of the inter-sensor distance, affect the frequency components of recorded speech signals [
9]. Then, proposing the microphone array with a proper and specific structure of the sensors increases the accuracy of the NoS estimating methods, which is one of the areas of interest in this research work.
In the last decade, various research works have been proposed for smart meeting room applications. Smart rooms include different types of microphone arrays with high quality microphones to provide the proper information for audio processing algorithms. In the cocktail party scenario, microphones record the overlapped speech signals of simultaneous speakers. Knowing the correct NoS is a common assumption in a speech processing system with various applications. This assumption is an important factor in the performance of speech signal processing. In common speaker separation algorithms [
10], traditional systems do not directly provide information about the number of speakers. Therefore, estimating the NoS is required for eliminating the gap between theoretical and practical systems in real environments. In recent years, a few methods for estimating the NoS were proposed based on microphone arrays. In theory, estimating the NoS is directly related to the speaker identification application, which is known as speaker diarization [
11,
12]. If a system shows who is speaking in a specific time, it naturally provides the information about the number of simultaneous speakers in the overlapped speech signal area, which is known as counting by detection. A diarization system with high accuracy detects the number of speakers efficiently. In addition, the available diarization systems are implemented with a clear segmentation process, where the first step in this system is detecting the segments with one active speaker. The segments’ borders are detected by speaker change detectors [
13]. The detected segments are selected for identifying the speakers’ locations of the recorded speech signals. The available segmentation methods fail in the cocktail party scenario when there are simultaneous active speakers. In fact, the overlapped segments in the speech signal are the main error source in diarization systems [
14]. Some algorithms were proposed based on detecting and rejection of the overlapped speech segments for improving the performance of detection-based methods [
15]. Detecting these segments is introduced as a research field in speech processing due to a wide range of proposed systems [
16,
17]. Finding the overlapped speech segments can be considered as a binary version of the NoS estimating challenge, where the number of overlapped speakers is one for non-overlapped signals and more than one for overlapped speech. Therefore, the NoS estimating methods can detect the overlapped speech segments but not vice versa. In addition, an overlapped speech detection system cannot be easily considered as a speaker separation algorithm. In fact, before using deep learning algorithms in speaker separation, the models require a high context, and in such cases as non-negative matrix factorization, the number of simultaneous speakers is introduced as a regularization process [
18].
2. State-of-the-Art
Since the overlapped speech signal is a common case in cocktail party scenarios, proposing a method for estimating the number of simultaneous speakers is an important task in these systems. Humans can separate one speaker from a mixture of speech signals due to their natural capacity [
19], and they use this benefit to estimate the number of simultaneous speakers in real conditions. In the last decade, a limited range of methods were proposed for speaker counting based on audio signals, where low accuracy in noisy and reverberant scenarios is the main weakness in these algorithms. Kumar et al. proposed an NoS estimating method in 2011 based on the Bessel features of the recorded signal from simultaneous speakers [
20]. The presented method is implemented on the recorded signals of multiple active speakers based on a pair of spatially separated microphones. The captured signals of the speech sources are considered for estimating the delay of the speech signals between the source and microphones, which are calculated by a cross-correlation function. The peaks in the cross-correlation function are not trustable due to the noise, reverberation, and sinusoidal components in the recorded speech signal based on the vocal track responses. These undesirable effects are reduced by limiting the components of the speech signal on low frequency bands based on a proper range of the Bessel coefficients. The Bessel function provides proper features of speech signals due to regular zero crossing, which is suitable for the processing of the speech signal. The proposed method estimates the NoS in scenarios where the number of simultaneous speakers is less than the number of microphones, but the accuracy is decreased in the reverberant scenarios. Tomasz and Miroslaw proposed a speaker counting method in 2018 based on human–machine interaction systems [
21]. The interreference between different sources significantly reduces the performance of the system in many applications. They proposed a speaker counting method based on the tracking of spectral peaks and extracted features of the spectrum. The statistical properties of extracted features are considered for detecting the relation between speakers, even when the features are not clear in some conditions. In addition, the non-negative matrix factorization method has been selected for classifying the signal spectrum for evaluating the perception of the speech signal. Fabian et al. in 2019 proposed a method for estimating the NoS based on supervised learning [
22]. They presented a unifying probabilistic criterion, where the deep neural network is utilized for extracting the output posterior function. These probabilities are evaluated for obtaining discrete point estimations. They showed that convolutional neural networks are a proper instrument for the evaluations based on extracted features of the speech signal. In addition, the presented method is robust in terms of the gain variation of the speech signal for different datasets, noise, and reverberation based on comprehensive evaluations. Humans based on their auditory and perceptual system, focus on a specific speaker in noisy spaces with simultaneous overlapped speakers. Based on this norm, Valentin et al. in 2019 considered the human brain characteristics for proposing a speaker counting system [
23]. They designed a perceptual study for evaluating the participant’s capacity for speaker counting in an acoustical environment. This research group considered speech diarization-based models due to the acoustical characteristics of the speech signal for estimating the number of speakers. In addition, the performance of the diarization system is increased for estimating the NoS by reducing the searching space for output sentences. If the proposed method is implemented over long time frames, the probability of detecting simultaneous speakers is increased for each time–frequency (TF) point, which is the benefit of using a limited search space. The presented method has been evaluated on various datasets, which estimate the number of simultaneous speakers, similar to human capacity. Shahab et al. in 2017 proposed a blind speaker counting method in the reverberant conditions based on clustering coherent features [
24]. The proposed method utilizes magnitude square coherence in the frequency domain (FD-MSC) between two recorded speech signals of different speakers as a trustable feature for detecting the number of speakers in reverberant environments. It has been shown that speakers have different FD-MSC features, which are obtained from the recorded speech signals of two microphones. The proposed method distinguishes the speakers based on their unique sound features, which are modeled by speech signals. Normally, speaker counting methods require the inter-sensor distance, which is not a requirement in this algorithm. In addition, the presented method is robust in noisy and reverberant scenarios for different inter-sensor distances. Ignacio et al. in 2018 presented a speaker counting method based on the variation of Bayesian probabilistic linear discriminant analysis (PLDA) in diarization techniques [
25]. They presented a system based on i-vector PLDA, where the i-vector observations are calculated and extracted for each time point of the speech signal based on an initial segmentation process. These i-vector values are classified based on the full Bayesian PLDA algorithm for the proposed system configuration. This model generates the diarization labels by averaging the repetitive Bayes variables based on the hidden parameters as the speakers’ labels. The number of speakers is estimated by a comparison between various hypotheses based on different information criteria. Pierre et al. in 2020 proposed a speaker counting method with high resolution based on improved convolutional neural networks [
26]. In the first step, they introduced multi-channel signals as the input of neural networks in the NoS estimating application. In the proposed method, the ambisonics features of multi-channel microphone arrays are combined with convolutional recurrent neural networks (AF-CRNN). The accuracy of CRNN algorithms is increased for estimating the NoS by extracting the sound signals’ features in short timeframes, which is considered to be proper information for speaker separation and localization algorithms. Junjie et al. in 2019 proposed a speaker counting method based on density clustering and classification decision (SC-DCCD) on overlapped frames of speech signals [
27]. In a reverberant scenario, the performance of TF methods is reduced based on the short-times frames in real conditions. To tackle this challenge, a density-based classification method was proposed for estimating the NoS based on a local dominant speaker in the recorded speech signals. The proposed algorithm contains various clustering steps based on a high reliability assumption in the signals’ frames. In the first step, the dominant eigenvectors are calculated by a local covariance matrix of the mixed TF speech components. Then, the eigenvalues are ranked based on the combination between local densities and the distance to the other local eigenvectors with better density. In the second step, a gap-based method is presented for calculating the clusters’ centers of the ranked eigenvectors in the real frequency bins. In the third step, a criterion based on the averaged clusters’ centers is introduced for preparing trustable clusters for the decision step. The presented method provides suitable estimations in reverberant and no-noise conditions.
In this article, a novel method is proposed for speaker counting by a new hive shaped nested microphone array (HNMA) in combination with wavelet packet transform (WPT) for smart sub-band processing and a two-dimensional steered response power (SRP) algorithm with adaptive phase transform (PHAT) and maximum likelihood (ML) weighting functions. Aliasing and imaging are important factors, which affect the frequency component of a speech signal. Therefore, an HNMA by the extension capacity is proposed for eliminating aliasing and imaging. Since the speech signal contains different information in each time frame, the Blackman–Tukey method is proposed for spectral estimation of the speech signal for eliminating the area with low speech components. In addition, the WPT is proposed for smart sub-band processing due to the variable resolution of the speech signal in different frequencies. In the following, the modified SRP algorithm is implemented in two dimensions on the sub-band information of the speech signal. Furthermore, various weighted functions are calculated in each sub-band for considering the effect of the SRP function. In addition, the weighting PHAT and ML functions are adaptively combined with the 2DASRP algorithm. Therefore, the standard deviation (SD) parameter is calculated for each sub-band, and the peak positions of the SB-2DASRP function in this range are passed and the rest eliminated. In the following, the passed peak positions are considered as the input for the unsupervised agglomerative classification method, and the number of clusters (speakers) is estimated based on the elbow criteria. The basic idea of this article was presented in our conference paper [
28] for a simple scenario on limited space. Furthermore, the proposed HNMA-SB-2DASRP algorithm is compared with more complex methods such as FD-MSC, AF-CRNN, and SC-DCCD.
Section 3 shows the models for microphone signals in real and simulated conditions and also the HNMA for eliminating aliasing between the microphone signals.
Section 4 presents the proposed NoS estimating method based on Blackman–Tukey spectral estimation, sub-band processing by WPT, and the modified two-dimensional sub-band SRP function by adaptive use of PHAT and ML filters in combination with unsupervised agglomerative clustering with elbow criteria. In
Section 5, the simulations of the proposed HNMA-SB-2DASRP algorithm are presented and compared to the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD methods in different noisy and reverberant scenarios. Furthermore, the computational complexity of the method is evaluated and compared with other works in this section. Finally,
Section 6 includes the conclusions of the proposed system for real-time implementation with different numbers of simultaneous speakers.
5. Results and Discussion
For evaluating the performance, reliability, and precision of the speech processing algorithms, they were implemented on both real and simulated data. Therefore, the TIMIT dataset was considered for analysis of the proposed HNMA-SB-2DASRP method for preparing the simulated data [
38]. The speech signals in the TIMIT dataset were recorded in various acoustical conditions for generating undesirable scenarios such as noise and reverberation. In addition, the proposed method was compared to other works based on real data, which were recorded in the computing and audio research laboratory at the university of Sydney, Australia. The simulated scenarios were considered as similar as possible to the real conditions, which makes the results comparable between these two environments. Since the proposed method estimated the number of simultaneous speakers, the simulations were implemented on the scenarios for two to five overlapped speakers, which normally happens in real recording conditions. More than five simultaneous speakers rarely appear in real scenarios. One male and one female speaker were selected for two simultaneous speakers’ conditions. Furthermore, a five overlapped speaker scenario included two female and three male simultaneous speakers.
Figure 7 shows a view of the simulated room for recording the speech signal in speaker counting applications. The proposed algorithm was considered for a specific application of smart meeting rooms, where the speakers are around and at a short distance from a microphone array. As seen, the proposed HNMA was located at the center of the room for preparing the best spatial symmetry of microphone signals. Based on the defined scenarios for the evaluations, the room dimensions and the HNMA position were selected as (684, 478, 351) cm and (342, 239, 165) cm, respectively. In addition, the speakers’ locations were considered as S1 = (173, 382, 177) cm, S2 = (132, 203, 172) cm, S3 = (325, 64, 164) cm, S4 = (594, 96, 169) cm, and S5 = (523, 394, 179) cm, respectively, which was a proper distribution of the speakers in the acoustical room. Speakers could have been located in every place in the room, but this distribution of the speakers was proposed since it is a common scenario in normal meeting rooms. Based on the HNMA and speakers’ positions, the near-field criteria was a correct assumption for this scenario in simulations. In addition,
Figure 8 shows the acoustical room in the University of Sydney for recording the real data.
Since the techniques for speech signal processing were implemented in short time-frames, a Hamming window with 55 ms length and 50% overlap was selected for providing the maximum proper information for each speaker.
Figure 9 shows the speech signal for each speaker and the overlapped signals for two to five simultaneous speakers, which were used in simulations of the proposed speaker counting algorithms. As seen, the length of the overlapped signals was decreased by increasing the number of simultaneous speakers, which clearly showed the lower probability of having many simultaneous speakers in the real scenarios. For the recorded data in the proposed scenario, 33.8 s of overlapped signal was related to the two simultaneous speakers, 29.2 s for three overlapped signals, 26.1 for four simultaneous speakers, and 24.2 s for five overlapped speakers, respectively.
The maximum frequency components and sampling frequency for the recorded signals were considered to be 8000 Hz and 16 kHz, respectively. Furthermore, noise, reverberation, and aliasing were the most undesirable factors, which decreased the quality of the recorded signals. The effect of aliasing was removed by the proposed HNMA, but the effect of the noise and reverberation were clearly observed in the recorded signals for the real scenarios. Therefore, simulated signals should be generated to be as similar as possible to the real recorded data. White Gaussian noise was selected as an additive noise in the microphones’ places, which accurately modeled the real environmental noise. Reverberation was another undesirable factor that decreased the accuracy of estimating the number of speaker methods. Various methods were proposed for simulating the effect of reverberation, where the image method was considered in the simulations of the proposed algorithm [
39]. The image method produces the room impulse response between each speaker and microphone by considering the speaker position, room dimensions, sampling frequency, impulse response length, reverberation time (
), and room reflection coefficients. The microphone signal is generated by convolution between the source signal and the produced room impulse response by the image method.
As was mentioned in
Section 4.1, the Blackman–Tukey method was considered for evaluating the speech spectrum by selecting the proper frequency bands with highlighted speech signals, and by eliminating the frequencies with low components of the speech. This method was implemented on all recorded signals by HNMA to select the best part of the microphone signals.
Figure 10 shows a sample spectral estimation by the Blackman–Tukey method and the selection of the proper spectral area on a frame of the recorded signal. As shown, in the frequency range [0–7800] Hz, the dominant speech components were in the frequency range [0–2657] Hz and [5871–6594] Hz, and in the rest of the signal, noise and reverberation were dominant. Therefore, a threshold as 30% of the maximum frequency amplitude of the signal was experimentally defined for selecting the proper frequency components and eliminating the undesirable frequencies. As shown, the desired frequency components of the speech signal were considered for the proposed speaker counting method, which increased the accuracy of the final results.
In the following, the SRP method was adaptively implemented in 2D format on sub-band speech signals based on the PHAT and ML. The PHAT and ML filters were selected for the speech signals with
and
, respectively [
35]. Therefore, the weighting filters prepared the best results for the sub-band-2DASRP algorithm.
Figure 11 shows the energy map of the SB-2DASRP function in a fixed z value (z = 170 cm) due to sample conditions of two, three, four, and five simultaneous speakers in
and
. As seen, the SRP’s peaks related to the real speakers were affected by the environmental noise and reverberation. For example, in the two simultaneous speaker scenarios (
Figure 11a), the peaks related to the speakers were seen more clearly, but for five simultaneous speakers (
Figure 11d), the number of false peaks was exponentially increased because of the greater number of speakers. By increasing the number of speakers, the effect of false peaks was increased in the energy map of the SRP function and proportionally decreased the accuracy of estimations due to the increase in environmental reflections. Proposing the weighing filters, comparison with SD, classification, and elbow criteria were the steps used for decreasing the undesirable effects and for preparing more accurate data in order to estimate the final number of simultaneous speakers.
The last step of the proposed system in
Figure 5 is unsupervised classification in combination with elbow criteria for estimating the number of clusters (speakers). As was explained in
Section 4.4, the number of clusters (speakers) was estimated based on the sharpness of changing the slope in the elbow curve. To select the correct number of speakers (
K) based on the elbow curve, the slope of the curve was calculated for each
K value. The slopes of the line before and after these points were calculated, and the point with the greatest difference between these slopes was automatically selected as the elbow.
Figure 12 shows the elbow diagram for two to five simultaneous speakers in a sample scenario with a moderate level of noise and reverberation (
and
). As is seen, the slope of the elbow curve changed by increasing the
K value, and suddenly the sharpness in the slope decreased in a specific
K value.
Figure 12a shows the sharp change in the slope for
K = 2, which means that the number of simultaneous speakers was two in this dataset for classification in the specific time frames of the overlapped speech signal.
Figure 12b–d similarly show the results for three, four, and five simultaneous speakers in various time frames, which clearly shows the benefits of elbow criteria for estimating the number of clusters (speakers). The large effects of noise and reverberation decreased the quality of the input data for the classification and indirectly the elbow criteria, but the results of the elbow method in the proposed system were more trustable due to the use of WPT and the implementation of the adaptive version of the SRP function based on PHAT and ML filters, which provided high-quality data for elbow criteria.
The proposed HNMA-SB-2DASRP method was compared to the FD-MSC [
24], i-vector PLDA [
25], AF-CRNN [
26], and SC-DCCD [
27] algorithms for two to five simultaneous speakers in the noisy and reverberant environments using real and simulated data. Two categories of the scenarios were considered for the evaluations. The first scenario was selected for evaluating the effects of the noise and reverberation separately for a fixed number of simultaneous speakers (five overlapped signals). Firstly, the noise was considered as a fixed value, and reverberation was variable, and secondly, the reverberation was fixed, and noise was variable. In the second category of the evaluations, all methods were evaluated for different numbers of simultaneous speakers on a high level of noise and reverberation as the worst acoustical conditions. Therefore, the results of the proposed system were reported based on the noise, reverberation, and the number of simultaneous speakers, separately. The percentage of the correct number of speakers was considered as the parameter for calculating the accuracy. Since the algorithm should be trustable and repeatable, only reporting the evaluations in one specific frame could not prepare the proper results. Therefore, we decided to repeat the proposed method on 100 frames and report the percentage of the correct number of speakers based on the obtained results. As was mentioned, the first category of the implementations was designed for evaluating the effect of the noise and reverberation.
Figure 13a shows the results of the proposed HNMA-SB-2DASRP method in comparison with the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms for five simultaneous speakers in fixed
and
for real (dash line) and simulated (solid line) data. By considering the
SNR as
, the noise was practically eliminated, and the effect of the reverberation was evaluated on the proposed system. For example, in
and
as a reverberant environment, the proposed HNMA-SB-2DASRP method estimated the correct number of speakers by 88% accuracy, the FD-MSC by 69%, the i-vector PLDA by 72%, the AF-CRNN by 76%, and the SC-DCCD by 80% using the simulated data for five simultaneous speakers. For the real data, the proposed HNMA-SB-2DASRP method estimated the correct number of speakers by 84%, FD-MSC by 66%, i-vector PLDA by 70%, AF-CRNN by 73%, and SC-DCCD by 78%, where the results were similar to the simulated scenario for five simultaneous speakers. As is seen, most of the methods lost accuracy in high reverberant scenarios, but the proposed HNMA-SB-2DASRP algorithm estimated the number of speakers with high accuracy.
Figure 13b shows the comparison of the proposed HNMA-SB-2DASRP method to the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms for five simultaneous speakers in fixed
and variable
. Based on the mentioned scenario, the proposed HNMA-SB-2DASRP method was compared to other systems for five simultaneous speakers for reverberant and variable effects of the noise on real and simulated data. For example, in
and
as a reverberant and noisy environment, the proposed HNMA-SB-2DASRP method estimated the number of correct speakers by 83% in comparison to 64% by the FD-MSC, 65% by i-vector PLDA, 70% by AF-CRNN, and 75% by SC-DCCD using the simulated data, which are shown with solid lines. In addition, this figure shows the results of the comparison of the real data between the proposed method and other research works. For example, in
and
, the proposed HNMA-SB-2DASRP method estimated the number of speakers by 82%, which was more accurate in comparison to the 60% by FD-MSC, 63% by i-vector PLDA, 68% by AF-CRNN, and 71% by SC-DCCD, which shows the superiority of the proposed method in noisy scenarios in comparison to other systems. The results in this figure represent the better accuracy of the simulated results in comparison to the real data due to the impossibility of controlling the exact values of noise level and reverberation time (
) in real recording scenarios. In addition, this figure shows that all methods estimated the number of simultaneous speakers with high accuracy in high
SNR and low reverberation time (
) conditions, but the accuracy was decreased by adding more noise and reverberation effects. In contrast, the proposed method estimated the number of simultaneous speakers with more reliability in comparison to previous works.
The second category of the experiments was conducted for evaluating the proposed method on a variable number of speakers in noisy and reverberant scenarios.
Figure 14a shows a comparison of the proposed HNMA-SB-2DASRP method to the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms in noisy and reverberant scenarios (
and
) for two to five simultaneous speakers using simulated data. This comparison was structured for evaluating the accuracy of each algorithm for various number of simultaneous speakers. For example, in the two simultaneous speakers scenario, the proposed HNMA-SB-2DASRP method estimated the correct number of speakers by 99% in comparison to the FD-MSC by 97%, i-vector PLDA by 98%, AF-CRNN by 98%, and SC-DCCD by 99%. These results show that most of the methods estimated the correct number of speakers with high accuracy in a small number of overlapped speech signals, even in noisy and reverberant conditions, due to the low interference and reverberation of the speech signal. For four simultaneous speakers, the proposed HNMA-SB-2DASRP method estimated the correct number of speakers by 89% in comparison to the FD-MSC by 71%, i-vector PLDA by 74%, AF-CRNN by 79%, and SC-DCCD by 84%, which shows the superiority of the proposed method with high numbers of simultaneous speakers.
Figure 14b similarly shows the results of the comparison between the proposed HNMA-SB-2DASRP method to the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms on the real data. For example, with two overlapped speakers, the percentage of the correct number of speakers for the proposed method was 99% in comparison to the FD-MSC by 95%, i-vector PLDA by 96%, AF-CRNN by 96%, and SC-DCCD by 98%, which shows the close results of all methods for a small number of simultaneous speakers using real data, which was similar to the results using simulated data. In addition, the methods were compared for a higher number of simultaneous speakers to show the effects of adding more speakers in the environment. For example, in four simultaneous speakers, the proposed method estimated the number of speakers by 87% accuracy in comparison to the FD-MSC by 67%, i-vector PLDA by 71%, AF-CRNN by 75%, and SC-DCCD by 82%. As seen, the accuracy of all previous methods was decreased by increasing the number of simultaneous speakers, but the proposed HNMA-SB-2DASRP algorithm estimated the number of overlapped speakers with more accuracy and reliability in comparison to other systems in noisy and reverberant scenarios with a high number of speakers, which shows the superiority of the proposed speaker counting system in comparison to the previous works.
Finally,
Table 1 represents the computational complexity of the proposed HNMA-SB-2DASRP method in comparison to the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms. The computer programming run-time in seconds was selected as a factor for measuring and comparing the complexity between the algorithms. The experiments were implemented by MATLAB software on a laptop with a i7-10875H CPU core (Intel, Santa Clara, CA, USA), 2.3 GHz, and 64 GB RAM. In addition, the results were extracted for noisy (
and
), reverberant (
and
), and noisy–reverberant (
and
) scenarios for five simultaneous speakers using real and simulated data. As seen, the AF-CRNN algorithm contained the highest computational complexity in comparison to other works due to the use of neural networks for training and testing steps. For example, on simulated data, the programming run-time for the proposed HNMA-SB-2DASRP method was 32 s in comparison to the FD-MSC by 52 s, i-vector PLDA by 44 s, AF-CRNN by 78 s, and SC-DCCD by 42 s in noisy and reverberant environments, which showed the low computational complexity of the proposed method in comparison to the other works. The results using the real data similarly showed the superiority of the proposed HNMA-SB-2DASRP method in comparison to previous research. For example, with real data, the proposed method had a programming run-time of 36 s in comparison to that of the FD-MSC of 54 s, i-vector PLDA of 47 s, AF-CRNN of 85 s, and SC-DCCD of 39 s in noisy and reverberant environments. Therefore, the proposed HNMA-SB-2DASRP method is considered as a proper instrument for estimating the number of simultaneous speakers in noisy and reverberant environments due to the high accuracy and reliability in the experiments and low computational complexity of the implementations.