1. Introduction
Language identification, as a front-end system for natural language processing technologies such as machine translation and multi-lingual information services, is a hot topic of research, and language recognition in realistic noisy environments has received more attention in recent years. In noisy and multi-speaker environments, the human ear may not be able to separate and identify the language types accurately, and language features may easily be disturbed or masked, preventing a clear representation of language information. Therefore, it is increasingly important to study voices in real-world scenarios with multiple speakers and in multi-lingual environments. In today’s globally integrated world, languages and dialects are mixed unprecedentedly. Neural network models brought the possibility of separating and understanding overlapped voices and boosted multi-lingual processing research, bringing significant progress for applying low-resource languages and dialects [
1]. In scenarios with multiple speakers and background noise, obtaining monolingual information for each speaker is difficult. It has been a novel topic in speech separation and language identification in recent years. Therefore, it is necessary to carry out a speech separation process before recognizing monolingual speech and understanding the content. As the first step in speech research, speech separation techniques are vital in determining the effectiveness of the speech back-end. The term speech separation initially originated from “Cherry’s Cocktail Party Problem” [
2] in 1953, where a listener could effortlessly hear a person’s speech surrounded by other people’s speech and ambient noise even in a cocktail party-like sound environment. The speech separation problem is often called the “cocktail party problem.” The goal of speech separation is to extract one or more source signals from a mixed speech signal containing two or more sources, with each speaker corresponding to one of the source signals [
3]. The two main speech separation methods currently under study are the single-channel speech separation method and the multi-channel speech separation method based on microphone arrays. This paper focuses on a single-channel two-speaker overlapped speech separation technique [
4]. Although humans can focus on one of the overlapping speech sounds, research methods still face difficulties in achieving this. Speech separation methods have enabled many speech processing algorithms, such as automatic speech recognition (ASR), to achieve better performance under multi-peak conditions.
Language identification technology is the process of determining the type of language to which speech content belongs using automated methods [
1]. In addition, in noisy and multi-speaker environments, the human ear may not be able to identify language types accurately, and language features may be easily disturbed or masked, preventing a clear representation of language information. Therefore, it is increasingly important to study language identification in real-world scenarios with multiple speakers and a multi-lingual environment. The key to language identification is feature extraction and the construction of language models. Currently, standard language features are mainly based on acoustic and phonetic features. The mainstream acoustic layer features include Mel frequency cepstral coefficients, Gamma pass frequency cepstral coefficients, and so on [
5]. The above features are prone to noise, resulting in poor identification results. The phoneme layer-based language identification method divides the speech into a sequence of phonemes and then recognizes the language according to the phoneme pairing between different languages [
6]. The phoneme layer-based features are less affected by noise, but phoneme segmentation is difficult to extract, resulting in a degraded identification performance. Recent research has begun to apply deep learning methods to language identification tasks and to use this method to train neural networks as phonological models. The method treats the language identification problem as a classification problem and thus trains the CNN to use the relevant linguistically labeled speech spectrograms.
However, most deep learning-based studies of single-channel speech separation rely on a single metric to evaluate the system, and the evaluation metric is relatively homogeneous. In our experiments, we indirectly analyzed the performance of the front-end network by looking at the evaluation metrics or the good or bad performance of the downstream tasks. Therefore, this paper focuses on the Single-Channel Speech Separation (SCSS) problem and applies the SCSS problem to complex multi-speaker multi-lingual scenarios. The experimental results obtained from the language identification network can be used as an intermediate reference index for the final SCSS network to optimize our separation network.
The rest of the paper is organized as follows. The current state of relevant research is presented in
Section 2, the algorithmic model structure is described in
Section 3, the experimental setup is described in
Section 4, and the experimental results and analysis are presented in
Section 5.
2. Related Work
Deep neural networks have been used for speech separation tasks with the development of deep learning techniques. One of the most common frameworks is the estimation of time-frequency masking of speech signals using neural networks. The basic theory is that after the short-time Fourier transform of speech, the energy at each time-frequency point is the sum of the powers of all target speech at that time-frequency point. There are differences in the distribution characteristics between different speakers’ speech in the time frequency domain, and speech separation can be performed by estimating the correct time-frequency masking matrix. In a similar speech enhancement task where the interference is a noisy signal, researchers have used different loss functions. Qi, J. et al. [
7] proposed to improve vector-to-vector regression for feature-based speech enhancement using a new distribution loss. The approach essentially takes into account the estimation uncertainty, and it avoids the burden of deploying additional speech enhancement models to obtain a final clean speech. In the same year, Qi, J. [
8] also proposed to use the properties of mean absolute error as a loss function for vector-to-vector regression based on deep neural networks for speech enhancement experiments and verified that DNN-based regression optimized using the MAE loss function can obtain lower loss values than the MSE counterpart. Siniscalchi, S.M. [
9] proposed an upper bound for the vector-to-vector regression MAE based on DNNs, which is valid with or without the “over-parameterization” technique. Single-channel speech separation is the process of recovering multiple speaker speech signals from a one-dimensional mixture of speech. Moreover, the training goal of our speech separation system was to maximize the scale-invariant signal-to-noise ratio (SI-SNR). Signal-to-noise ratio improvement (SDRi) was used as the main objective metric to assess separation accuracy. The specific calculation formula is presented in Part 4 of the text. In 2014, Wang De Liang et al. presented a supervised training method for the speech separation task [
10], comparing and analyzing the effects of different time-frequency features and different time-frequency masking matrices on the speech separation task, including Ideal Binary Mask (IBM) and Ideal Ratio Mask (IRM), and compared with the NMF-based method for speech. In 2016, Deep Clustering (DPCL) [
11] was proposed by mapping each time-frequency point to a high-dimensional space, and taking the mean square error between the affinity matrix of the corresponding high-dimensional vector and the affinity matrix of the label composition vector as the training target. It has an SDRi value of 10.8. This is a clever solution to the label substitution problem and allows effective speech separation even when the number of speakers in the testing phase of the network is different from that in the training phase by setting a different number of clustering centers. In 2017, Yu D et al. [
12] proposed a permutation invariant training (PIT) algorithm, which first calculates all combinations of network outputs and labels, and also selects the smallest of these results as the value of the loss function, solving the label replacement problem in this way. It has an SDRi value of 10.9. In 2017, Y. Luo et al. presented the Deep Attractor Network (DANet) [
13], which uses IRM to weight the embedding vector corresponding to each time-frequency point in the training phase to find prime cluster points and generate time-frequency masks based on the distance between different prime points and each time-frequency point. In 2019, the literature [
14] presented a computerized auditory scene analysis system based on deep learning (CASA), which combines the respective advantages of PIT and DPCL for speech separation through two stages of training, with the first stage using PIT to separate speech at the time-frame level and the second stage referring to DPCL for clustering and combining separated speech from different time-frames.
In terms of time domain speech separation, Y. Luo et al. presented a time domain speech separation network (Tas-Net) [
15], which directly uses speech time domain information as input to the network, encodes and decodes the speech using a convolutional layer with thresholding, and then separates the speech by estimating a mask matrix of encoded features using a separator consisting of an LSTM network. It has an SDRi value of 10.8. In 2019, Y. Luo improved on the Tas-Net network model by replacing the LSTM with a Temporal Convolutional Network (TCN) to propose a fully-convolutional time-domain speech separation network (Conv-TasNet) [
16], which replaces the separator LSTM of the Tas-Net network with a TCN, which can obtain a large perceptual field with a small number of parameters, thus better capturing long-distance contextual information. It can reach an SDRi value of 15.3. In 2020, Y. Luo proposed the Dual-Path Recurrent Neural Network (DPRNN) [
17], which performs speech separation by partitioning long sequences into smaller blocks and applying intra- and inter-block RNNs. In 2021, the literature [
18] presented the SepFormer network, which performs speech separation by replacing the original RNN network with a Transformer network. In the same year, a self-attentive network with an hourglass shape (Sandglasset) [
19] was proposed, which enhances speech separation network performance with a smaller model size and computational cost by focusing on multi-granularity features and capturing richer contextual information. Moreover, it has been relatively effective, with an SDRi value of 21.0. In 2022, the SepIt [
20] speech separation network was proposed, which improves the performance of speech separation networks by iteratively improving the estimation of different speakers. This is the latest method that came out with a relatively high SDRi value of 22.4.
The process of language identification is a process of classification and judgment, and the key is the acquisition of valuable features and the construction of language models. For a general language identification system, the most important thing is the recognition rate, the accuracy of the recognition results must be guaranteed, and this is the starting point of all evaluation metrics. The following will introduce the more commonly used performance evaluation metrics, which are Accuracy, Precision, Recall, and F1 value. Moreover, the specific formulas for calculating these indicators are described in Part 4 of our text. The traditional methods of language identification are mainly based on the Gaussian mixed model and identity vector (i-vector) based components. The Gaussian hybrid-universal background model (GMM-UBM) approach was proposed in the literature [
21], requiring enormous data to estimate the covariance matrix. In the literature [
22], a Gaussian mixed model-support vector machine (GMM-SVM) mean super vector classification algorithm was proposed, which has improved the recognition performance compared to the GMM-UBM method. The literature [
23,
24] used i-vector features extracted from audio for language identification, which effectively enhanced the recognition results and became one of the primary language identification methods.
As deep learning is widely used in various tasks, researchers have also started to apply deep neural networks to language identification research. The literature from 2014 [
25] proposed the first large-scale application of DNN models to short-time speech segment language tasks for language identification at the speech frame-level feature level. In addition, Convolutional Neural networks (CNN), Recurrent Neural networks (RNN), and Long Short-Term Memory (LSTM) have also been applied to language identification, resulting in a breakthrough improvement in language identification performance [
26,
27,
28,
29]. Subsequently, Geng W, Raffel C, Mounika K V, et al. [
30,
31,
32] proposed an end-to-end language identification framework based on the attention mechanism, which improves the effectiveness of language identification by obtaining more valuable information on speech features for language discrimination through the attention mechanism. In 2018, Snyder D [
33] proposed an x-vector language identification method, which outperformed the i-vector. In 2019, Cai W et al. [
34] proposed an attention-based CNN-BLSTM model that performs language recognition at the discourse level and can obtain discourse-level decisions directly from the output of the neural network. In 2020, Aitor Arronte Alvarez et al. [
35] proposed an end-to-end network Res-BLSTM language identification method combining Residual Block and BLSTM network. However, most deep learning-based studies of single-channel speech separation rely on a single metric to evaluate the system, and the evaluation metric is relatively homogeneous. In our experiments, we indirectly analyzed the performance of the front-end network by looking at the evaluation metrics or the good or bad performance of the downstream tasks.
In this paper, based on the above research on single-channel speech separation and language identification, the speech separation task was applied to a complex multi-speaker multi-lingual scenario to obtain the corresponding monolingual information of each speaker to facilitate subsequent speech recognition or other operations. The experimental results obtained from the language identification network can also be used as an intermediate reference index for the final single-channel speech separation network to optimize our separation network.