1. Introduction
Speaker recognition is a part of biometric technology. This technology is often used in security detection, such as access control, company punch cards, bank vaults, robots, and smart living. Speaker recognition technologies can be divided into text-dependent [
1] and text-independent [
2]. In text-dependent speaker recognition, the speaker must utter a sentence with specific content. For example, ref. [
1] applied the wake-up word model to speaker recognition. Text-independent speaker recognition has no restrictions on sentence content, and recognition focuses on confirming the speaker’s identity. For example, ref. [
2] proposed a text-independent speaker verification method in noisy environments by fusing different features. Text-independent forensic speaker recognition was studied in [
3]. Forensic speaker recognition means that when a suspect has left sound evidence, speaker identification needs to be found through the sound [
4,
5]. Text-dependent speaker recognition is easier to train but more limited because the same sentence needs to be spoken during recognition. Text-independent speaker recognition is challenging because it usually requires more data to achieve better results, but it is relatively more flexible in application. Our research aimed to construct a text-independent speaker recognition method.
We discuss two main aspects of the speaker recognition process: feature extraction and speaker modeling. In previous studies, many techniques have been developed for extracting features from sound signals. For example, linear frequency cepstral coefficients (LFCC) [
2,
6] and mel-frequency cepstral coefficients (MFCCs) [
2,
6,
7] use filter-bank coefficients for feature extraction. The differences between LFCC and MFCC are described in [
2]. In LFCC, sound signals of all frequencies are regarded as equally important. The design of MFCC considers the degree and influence of the perception of different sound frequencies by the human ear. In reality, the human ear is more sensitive to changes in low-frequency sounds; therefore, the MFCC method emphasizes low-frequency sound signals. The MFCC calculation is described in detail in [
8]. Kinnunen and Li [
9] described some text-independent speaker recognition techniques and discussed and recommended feature selection in their paper, including the use of MFCC features. Sahidullah and Saha [
10] proposed a new windowing technique for MFCC computation and used speaker recognition accuracy for comparison with past windowing techniques. In reference [
7], dynamic time warping (DTW) and features extracted using MFCC were utilized for speaker recognition. The two feature extraction techniques, LFCC and MFCC, have been used to classify insect singing [
6]. The performance achieved from both approaches was good, but most insects’ sounds are high-frequency signals, so using LFCC as the classifier input provides better results than using MFCC. Spoken language identification utilizes MFCC features and gammatone cepstral coefficient features [
11]. Because our research direction is to construct a method of speaker recognition for the human voice, we use MFCC for feature extraction, as it is more suitable for recognizing the human voice.
Many models have recently been applied in the field of voice recognition, the primary purpose of which is the recognition output after the model calculates the input data. Hidden Markov models (HMMs) were applied to speaker-independent word recognition for the first time [
12]. HMMs use multiple states to construct implicit unknown parameters and are predominantly used for word recognition. Ref. [
13] used a Gaussian mixture model (GMM) to calculate the input signal into multiple Gaussian components to represent the spectral shape of the speaker. Because a GMM uses multiple Gaussian components to construct the sound model of the input signal, it is primarily used for identity model construction and recognition. Ref. [
14] used a Gaussian mixture model–universal background model (GMM-UBM) to develop a language identification system (LID) and improved the GMM through universal background model (UBM) performance. GMM-UBM has the disadvantage of requiring significant data to construct a model. The UBM uses different categories to establish a general background model and then puts all categories of data into the UBM for parameter adjustment. The recent boom in neural network (NN) technology has resulted in significant progress in the application of artificial intelligence (AI) and has many applications in the field of voice recognition. The architecture of the NN [
15] includes the input, hidden, and output layers. The input layer controls the size of the input data. The hidden layer is the core of the entire architecture and is usually designed according to requirements. The output layer is a layer of output data, and the output method varies according to the requirements. For example, a classification outputs the probability of each category, and the speaker recognition method in this study outputs a set of feature vectors through a neural network. A deep denoising autoencoder composed of an NN was used to generate a set of denoised feature vectors, and a GMM-HMM was used for word recognition [
16]. Training a single deep neural network (DNN) for speaker and language recognition was proposed in [
17]. A method for Chinese text to speech (TTS) using a recurrent neural network (RNN) was proposed in [
18]. The differences between RNNs and general NNs are described in [
19]. Feedback from the previous layer is added to each hidden layer of the RNN. As the RNN retains the memory of the previous layer, it can be used to process dynamic information. However, as the number of time series increases, the memory retained by the RNN in the early stage gradually disappears. Therefore, RNNs are unsuitable for long-term memory, and long short-term memory (LSTM) overcomes this problem [
20]. The introduction of LSTM and RNN is described in detail in [
21], and [
22] describes how LSTM can improve the problem of the RNN vanishing gradient, a variant of RNN. LSTM adds three gates to control long-term memory: the input, forget, and output gates [
23]. The input gate is used to determine which messages need to enter the memory path; the forget gate is used to determine which needs to be deleted in the memory path; and the output gate is used to determine which messages need to be output [
24]. In voice recognition research, many applications use LSTM. For example, ref. [
25] used an attention-based LSTM for speech emotion recognition in classification. Ref. [
26] studied methods to improve the accuracy of speech emotion recognition using nonverbal vocal fragments, and the model for emotion recognition was LSTM. Ref. [
27] proposed a bidirectional LSTM (BLSTM), which is composed of two LSTMs in different directions; therefore, it has two memory paths in the forward and reverse directions, and the number of parameters is twice that of LSTM. BLSTM has several applications in voice recognition. Examples include speech gender classification [
28], speech emotion recognition [
29]. and native language identification in brief speech utterances [
30].
This paper proposes three new speaker recognition methods and a new cluster training method. The speaker recognition methods we propose are (1) long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL) method; (2) bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL) method; and (3) bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL) method. In the LSTM-MFCC-TL method, we use MFCC to extract sound features and feed them into LSTM for training. A set of feature vectors representing the speaker is output by LSTM, and the loss function used in training is the triplet loss. Triplet loss is derived from the FaceNet [
31] architecture published by Google and applied to face recognition systems. We use triplet loss as the loss function during training because it shortens the distance between all samples from the same speaker and increases the distance between different speakers during the training process. We use the cluster training method for model training in all the methods. The cluster training method divides the training data into small groups before training and trains all groups in sequence. The second method, BLSTM-MFCC-TL, uses a BLSTM model, enabling the network to have a bidirectional memory path. Thus, it can obtain features that are more representative of the speaker from the forward and reverse directions of the time series. In the third method, BLSTM-MFCCAE-TL, a set of autoencoder (AE) features is generated using the pretrained AE in [
16] and combines the AE features with the MFCC features for input to the model training. Because the research direction of [
16] is denoising word recognition and our direction is speaker recognition, the training purposes and datasets of the two are different and cannot be directly compared. Therefore, we refer to the model architecture of [
16] and construct a baseline method for speaker recognition, which consists of a Gaussian mixture model and hidden Markov model with mel-frequency cepstral coefficients and autoencoder features (GMM-HMM-MFCCAE) method. The main contributions of this study are as follows:
The new cluster training method proposed in this study can solve the problem that hardware equipment cannot be loaded during training.
It is confirmed that using BLSTM can yield more helpful speaker features than using LSTM for speaker recognition.
Extracting a set of AE features through a pretrained autoencoder, combining them with MFCC features, and inputting them into the model for training can effectively improve speaker recognition accuracy.
Section 2 describes the three speaker recognition methods proposed in this paper in detail.
Section 3 discusses the experimental methods and results of each method proposed in this study.
Section 4 presents the conclusions of this paper.
2. Proposed Methods for Speaker Recognition
In previous sound-processing methods, numerous signals imposed a considerable burden on computers. It is challenging to find characteristics in many signals because they contain too much noise and irrelevant signals; thus, extensive audio processing has been required. Many feature extraction techniques have been developed that have made significant contributions in audio signal processing, such as MFCC [
2]. In this paper, we study text-independent [
2] speaker recognition, which does not need to retain the original signal of the sound. Therefore, MFCC technology will be used to extract the features of the sound sequence. This study utilizes neural network [
15] techniques to construct a speaker identification model. Firstly, the results of MFCC feature extraction are used as inputs to the model. After passing through the neural network model, a set of feature vectors representing the speaker’s identity is obtained. Subsequently, similarity distance calculations are used to determine the accuracy of the identification results. This paper presents three methods for speaker recognition in total. In terms of deep learning models, the LSTM-MFCC-TL method and BLSTM-MFCC-TL method employ the LSTM model and BLSTM model for speaker recognition, respectively. The BLSTM-MFCCAE-TL method also utilizes the BLSTM model and employs a pre-trained autoencoder to generate AE features. We encode the speakers’ speech and then concatenate these AE features with the original MFCC features as input for training the BLSTM model. These features encoded by the autoencoder can enhance learning effectiveness and provide better accuracy in speaker recognition.
The main challenge focuses on the proposed BLSTM-MFCCAE-TL method in this paper. In this proposed method, we employ an autoencoder to encode the speaker information in the speech signal and concatenate this set of encoded AE features with MFCC features as inputs to the deep learning model. Since the AE features are the result of speaker encoding, the incorporation of these AE features can effectively enhance the learning performance of the model. In the following subsections, we discuss the three proposed methods in detail.
2.1. Long Short-Term Memory with Mel-Frequency Cepstral Coefficients for Triplet Loss (LSTM-MFCC-TL) Method
The first method in this paper is the LSTM-MFCC-TL method. The speaker recognition process of the proposed LSTM-MFCC-TL method is illustrated in
Figure 1. We divide the LSTM-MFCC-TL method into five steps. The first step is to perform MFCC feature extraction on the input audio signal. Feature preprocessing is performed on sound feature sequences of different lengths in the second step. The third step is to group all input features for subsequent training using the cluster training method. We construct the model using LSTM [
21] in the fourth step. In the fifth step, the loss function used during training is the triplet loss [
31].
2.1.1. MFCC Feature Extraction
It is challenging to observe the characteristics of the original sound signal. Therefore, in the processing of the sound signal, the original sound signal is usually converted into a spectrum signal for feature observation and processing. MFCCs are a well-known feature extraction method in the traditional sound recognition field. The MFCC is a popular feature in speech and audio processing [
9]. In the early 1980s, MFCCs were adopted for speech recognition and later introduced for speaker recognition. In our proposed three methods, we use MFCC as the input feature and leverage MFCC’s adoption in human auditory perception. This can effectively capture the important features in speaker recognition.
The MFCC feature extraction process is shown in
Figure 2. The sequential process of MFCC feature extraction is pre-emphasis, framing, Hamming windowing, fast Fourier transform (FFT), mel filter bank processing, and discrete cosine transform (DCT).
First, the pre-emphasis compensates for the high frequency of the input sound signal to eliminate the radiation of the vocal cords and lips to compensate for the suppressed high-frequency part. Pre-emphasis was calculated as follows:
where
is the
-th input value,
is the filter coefficient, and
is the
-th output after the pre-enhancement. The filter coefficient can be set between 0.9 and 1 and is usually set at 0.95. Framing divides the signal into shorter time frames after the pre-emphasis. The time length of each frame was 20–40 ms. After framing, each frame passed through a window function. The Hamming window function was used here. We use
to represent the output of the
th frame after passing through the Hamming window. This is calculated as follows:
where
xn is the input
th frame, and
is the Hamming window for the
th frame. Furthermore,
was calculated as follows:
where
denotes the total number of frames. An FFT is performed on the output signal through the Hamming window to convert the original signal from the time domain to the frequency domain. The FFT is a fast algorithm for computing the discrete Fourier transform (DFT). The output of the FFT is then converted to the mel-frequency domain through mel filter bank processing. We define
as the mel-frequency corresponding to the general physical frequency
, calculated as follows:
After obtaining the result using , the DCT was used to calculate 12 MFCCs for each frame. In addition to these 12 MFCC coefficients, we considered the energy per frame obtained after the framing step above as one of the MFCC features. Therefore, after the sound signal of this method was extracted using MFCC features, 13 MFCC features were obtained.
Next, we used the original 13 MFCC features to compute delta cepstrum, resulting in 13 delta features that represent the temporal changes in the cepstral parameters. The calculation method for delta cepstrum is as follows,
where
represents the original MFCC features and
represents the delta features obtained after applying delta cepstrum.
is the output of the
th MFCC feature at time
t, and
is the number of differential observations before and after time
t. Here, we use
to calculate the
of the two sets of 13 features, plus the original
, for 39 MFCC features. Therefore, each audio frame obtains a set of MFCC feature vectors of length 39 after the MFCC feature extraction. However, owing to the different lengths of each audio signal, the number of audio frames is also different.
2.1.2. Feature Preprocessing
Because the length and frequency of the original audio are inconsistent, the number of frames is different, and the number of audio frames on the vertical axis of the MFCC eigenvector is different. The input size must be the same when training the neural network model. Therefore, we normalize the number of frames in the MFCC, such that the number of frames is unified to the same size. For example, and are two MFCC feature vectors with different frame numbers. After feature preprocessing, the size of all feature vectors becomes , where is the set frame quantity.
Consider the MFCC feature vector as an example. If the number of frames is less than N, then must be added to . Therefore, the original MFCC feature vector is merged with the zero vector of size to obtain an MFCC feature vector of size (N, 39). If the number of frames is greater than , then the original MFCC feature vector is divided into feature vectors of positive-integer groups . The remaining frames are then combined with the zero vectors above to obtain C + 1 MFCC feature vectors of size . In this study, the number of frames was set to in the experiment. Therefore, in this step, the size of all features was unified to .
2.1.3. Cluster Training Method
During the training of the three methods proposed in this paper, owing to the limitations of hardware equipment, training must be implemented using small samples. The architecture of the cluster training method is shown in
Figure 3, and the clustering method used in this study is used as a demonstration. The dataset used in this study has 400 different categories, and we cut the data of all categories into 40 groups, each of which has 10 categories of data. Subsequently, all clusters were trained sequentially, and their loss and accuracy were calculated. When all 40 groups were trained once, each group’s loss value and accuracy rate were averaged to obtain the loss value and accuracy for the training round.
2.1.4. LSTM Model
It is discussed that RNNs utilize their feedback connections to store representations of recent input incidents [
22]. However, traditional backpropagation through time or real-time recurrent learning lead to the gradient exploding or vanishing when error signals flow backward in time. To improve error backflow, a novel recurrent network architecture is proposed with an appropriate gradient-based learning algorithm that is LSTM [
22]. In the case of noisy input sequences, LSTMs learn to bridge time intervals exceeding 1000 steps without loss of short-timelag capabilities. LSTMs are efficient gradient-based algorithms achieved for an architecture-enforcing constant. They are neither exploding nor vanishing because error flows through internal states of special units. Therefore, the performance of LSTM-based models for processing time series data is better than that of RNN-based models. The LSTM model architecture in the LSTM-MFCC-TL method is shown in
Figure 4, and the relevant model parameters are listed in
Table 1. A feature vector with a size of
after feature preprocessing is used as the input, the LSTM architecture is used as the central part of the model, and the dropout layer is used to prevent the model from overfitting. Then, the flattened layer is used to flatten the two-dimensional vector of size
into a one-dimensional vector of
after the LSTM and dropout layers. Finally, a dense layer with
neurons is used as the output of the LSTM model, and the output is a feature vector of
.
In neural network technology, the neural network is divided into input, hidden, and output layers. In the input layer, we take the preprocessed MFCC feature vector of size
as input, where
is the number of frames of a fixed size during preprocessing. For the hidden layer, we use the LSTM with the number of neurons as
and then connect the dropout layer to freeze the data with the ratio of
to prevent the model from overtraining.
Figure 5 shows a schematic of the LSTM architecture, highlighting that the LSTM model maintains the memory selection of each time step in the LSTM path, so that the LSTM retains the characteristics of long-term memory.
In
Figure 5,
represents the internal structure of the LSTM. When entering the next time step, part of the memory selection of the current time step is retained in the path. This effect is possible because LSTM has four unique gates: an input information gate (
), a forget gate (
), an input gate (
), and an output gate (
). The architecture of the LSTM model is shown in
Figure 6. The calculation of each gate valve in
Figure 6 can be represented by the calculation of
, which is as follows:
where
is the output gate valve of the
th time step, and activation is the activation function of the gate valve;
is the information (
) and the weight value (
) assigned to the
th time step for each gate;
is a prediction value (
) and a weight value (
) assigned to the
th time step for each gate valve; and
is the residual value for each gate valve. We then organize the equations of the entire LSTM model as follows:
where
is the input information gate,
is the forget gate,
is the input gate,
is the output gate,
is the memory neuron that stores the
th time step, and
is the predicted value. Both
and
are activation functions, where
is a sigmoid function.
2.1.5. Loss Function
During the model’s training, we chose the triplet loss as the loss function. Triplet loss randomly selects a sample from all training samples as the anchor. Then, we randomly select a sample of the same category as an anchor and call it positive. Finally, a sample of a different category from anchor was randomly selected and called negative. The above three represent the triplets in triplet loss, and for the anchor, positive, and negative triplet loss. We can write the following equation:
where the anchor, positive, and negative are denoted as
,
, and
, respectively.
M is the margin, which is an adjustable parameter greater than
;
is the distance calculation function, and in this case, the Euclidean distance [
32,
33] is used. We defined
as the Euclidean distance between
and
, where
and
. The algorithm for
d(
x,
y) is as follows:
After calculating the positive and negative Euclidean distances of the anchor, we divided triplet loss into the following three types:
The first case
is called an easy triplet. In this case, the anchor and positive values are close, which is the best case. The second case
L2 is called a semi-hard triplet. In this case, although the anchor is similar to the positive, it is also similar to the negative because the difference between the two is within a margin. The third case
is called the hard triplet. In this case, the distance between the anchor and the positive is considerable, which is the worst case.
Figure 7 shows the differences before and after the triplet-loss operation.
The state before training was
. After training, triplet loss shortens the distance between the anchor and the positive and extends the distance between the anchor and the negative. After the training, the state was
.
Figure 7 shows that the anchor and the positive have become close, and the distance between the anchor and the negative has increased significantly.
2.2. Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients for Triplet Loss (BLSTM-MFCC-TL) Method
Based on the LSTM-MFCC-TL method combined with BLSTM, we propose a BLSTM-MFCC-TL method. BLSTM is composed of a two-way two-layer LSTM that obtains two-way information, expanding the model’s perception field. The speaker recognition process of the proposed BLSTM-MFCC-TL method is shown in
Figure 8. We divided the BLSTM-MFCC-TL method into five steps, where the fourth step differs from the LSTM-MFCC-TL method, and BLSTM [
27] is used as the training model.
The fourth step of the BLSTM model in the BLSTM-MFCC-TL method was proposed, as shown in
Figure 9, and the parameter settings in the BLSTM model are listed in
Table 2. Due to the BLSTM architecture, the number of parameters increased to 256, twice that of the LSTM architecture. Although the BLSTM architecture increases the number of parameters, it can help improve accuracy. The output size of the fully connected layer is the same as that of the LSTM-MFCC-TL method. Therefore, at the end of the fourth step, the BLSTM model outputs a feature vector of size
.
BLSTM [
27] used in the BLSTM model is a bidirectional application of LSTM architecture.
Figure 10 shows a schematic of the BLSTM architecture. The operation process of BLSTM is divided into forward LSTM and reverse LSTM, a combination of the bidirectional two-layer LSTM.
In
Figure 10, the x-sequence input to the BLSTM is passed into the forward LSTM and the reverse LSTM, and the predicted values of the two LSTMs are combined as the output
y-sequence. If a single-direction LSTM architecture is used, the information volume of the LSTM can be increased only in a forward manner. The advantage of BLSTM is that it can obtain two-way information, thereby expanding the perception field of the model. Therefore, BLSTM can achieve better performance than general LSTM.
2.3. Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients and Autoencoder Features for Triplet Loss (BLSTM-MFCCAE-TL) Method
A pre-trained autoencoder was employed to encode the voices of speakers, which achieved better results in word recognition [
16]. Based on a given success illustration, employing the same autoencoder architecture in speaker recognition can also improve the accuracy. We use a pre-trained autoencoder to encode MFCC features into AE features for speaker encoding. Therefore, in our proposed BLSTM-MFCCAE-TL method, we achieved better accuracy in speaker recognition by combining MFCC with AE features. We refer to the model architecture of [
16] to construct the GMM-HMM-MFCCAE method for speaker recognition. Because the architecture in [
16] is used for denoising word recognition, the correct rate for speaker recognition is 64.48%, which is low.
The speaker recognition process of the proposed BLSTM-MFCCAE-TL method is shown in
Figure 11. We divided the BLSTM-MFCCAE-TL method into six steps: MFCC feature extraction, feature preprocessing, AE modeling, cluster training, BLSTM modeling, and loss function. In the third step of the BLSTM-MFCCAE-TL method, we refer to [
16] to construct a pretrained AE model and combine the AE features output by the AE model with the MFCC features and input them to the BLSTM model in the fifth step of training. The third and fifth steps are described in detail below.
2.3.1. AE Model
The AE architecture for the third step of our proposed BLSTM-MFCCAE-TL method is shown in
Figure 12. An AE is composed of a neural network, and the architecture is divided into an encoder and a decoder. The encoder encodes the input data into a set of vectors. The decoder uses the output of the encoder as the input and decodes it. According to [
16], additional features obtained from the pretrained AE can improve recognition accuracy. In the AE of the third step, the encoder’s input is the
MFCC feature, after a layer of 2048 neuron-size neural network, and then outputs 1024 neurons. The decoder then decodes 1024 neurons and outputs the decoding result of
. In the pretraining stage, the input and output layers of the AE are two audio features of the same speaker. The AE is trained using the sound features of unrelated text to find the feature vector that represents the speaker. After the pretraining of the AE, we take the first 1014 feature vectors of the encoder output layer and reshape them into two-dimensional feature vectors of
, which we call the AE feature. The MFCC features of
were then combined with the AE features of
. In this study,
was set as
in the experiment. The combined feature vector of size
is used as the input of the BLSTM model in the next step.
2.3.2. BLSTM Model
After the BLSTM-MFCCAE-TL method proposed in this paper obtains the AE features through the AE model, the AE and MFCC features are combined into a feature vector of
, and this set of feature vectors is input into the BLSTM model for training. The fifth step of the BLSTM model architecture in the BLSTM-MFCCAE-TL method is shown in
Figure 13, and the parameters in the BLSTM model architecture are listed in
Table 3. Because the BLSTM-MFCCAE-TL method adds AE features of size
, the feature size in the input layer increases from
in the BLSTM-MFCC-TL method to
.
3. Experiments and Analysis
To investigate the performances of the methods in this paper, we execute speaker recognition experiments using the AISHELL-1 dataset [
34]. The AISHELL-1 dataset consists of Mandarin speech recorded with a high-fidelity microphone (44.1 kHz, 16-bit). The AISHELL-1 dataset was produced by the audio recorded through the high-fidelity microphone, which was down-sampled to 16 kHz. The AISHELL-1 dataset is a public speech dataset with a total of 400 speakers from different accent regions in China who participated in the recording. In this paper, the AISHELL-1 dataset is used in the speaker recognition. In our experiments, we use the entire dataset of 400 speakers, divided into a training set and validation set with proportions of 90% and 10%, respectively.
Table 4 lists the experimental parameters of the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods. The main models in
Table 4 refer to the LSTM and BLSTM architectures in the three methods, where the input feature shape is the feature size of the main input model, and the output feature shape is the feature size output by the main model.
Table 4 shows that the LSTM-MFCC-TL and BLSTM-MFCC-TL methods proposed in this paper are all MFCC feature vectors of
in the main model’s input part. The input of the main model of the BLSTM-MFCCAE-TL method uses a combination of MFCC and AE features, and its size is a feature vector of
. The main model of the LSTM-MFCC-TL method is the LSTM architecture, and the main model of the BLSTM-MFCC-TL and BLSTM-MFCCAE-TL methods is the BLSTM architecture. The learning rate used during training is 0.00001, and the optimizer is Adam. The batch size is 64. The loss function uses triplet loss. The number of training epochs is 20, and the output feature vector is of size (1, 128).
Next, model training was carried out for the above model and the experimental parameters and methods. During training, the loss value and accuracy rate of each round were recorded to observe each model’s training effect from the numerical changes. First, we compared the loss values of our proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods, as shown in
Figure 14 and
Table 5.
The six-line graph in
Figure 14 shows the change trends of the training and the validation data loss values of the LSTM-MFCC-TL, the BLSTM-MFCC-TL, and the BLSTM-MFCCAE-TL methods in each epoch. In the second epoch, the loss value of each method was significantly reduced. Subsequently, the gap between each method’s training and the validation set loss values becomes increasingly closer, and the training loss value gradually becomes stable. From
Table 5, we can see a comparison of the loss values of the three proposed methods in the 20th round, especially to observe whether the training and the validation set loss values between the methods exhibit overfitting. At the 20th epoch, the LSTM-MFCC-TL method proposed in this paper has a loss value of 2.98% for the training set and 1.15% for the validation set. The loss value of the BLSTM-MFCC-TL method was 1.19% for the training set and 0.42% for the validation set. The BLSTM-MFCCAE-TL method had a training set loss of 0.75% and a validation set loss of 0.49%. From the comparison of the loss values of each method, it can be seen that there is not much difference between the loss values of the training and the validation sets, and our method also obtains a lower loss value. Thus, the three proposed methods are all effectively trained. We then compared the accuracy of our proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods, as shown in
Figure 15 and
Table 6.
Figure 15 shows the changing trends of the training and validation data accuracies in the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods in the first 20 epochs. First, the accuracy of each method exhibits noticeable amplitude changes in the second epoch, and then it increases steadily. The results for the 20th epoch are listed in
Table 6.
In this speaker recognition experiment, the accuracy is defined as the ratio of correctly recognizing the speaker to all prediction results. In comparison, the evaluated validation accuracy for the proposed methods, namely LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL, is 89.07%, 91.18%, and 93.08%, respectively. Therefore, using the same BLSTM model and MFCC feature, the BLSTM-MFCCAE-TL method, which incorporates AE features, significantly improved the validation accuracy from 91.18% to 93.08% compared to the BLSTM-MFCC-TL method. The architecture of the GMM-HMM-MFCCAE method [
16] was designed for application to word classification and achieved a word classification accuracy of 92.81%. However, the GMM-HMM-MFCCAE method is only 64.48% accurate in the validation set applied to identity person recognition, which is the worst performance, as shown in
Table 6. Then, we propose three identification accuracy methods; the LSTM-MFCC-TL method’s accuracy rate is 89.07%. The validation set accuracy of the BLSTM-MFCC-TL method is 91.18%. The validation set accuracy of the BLSTM-MFCCAE-TL method is 93.08%. From the changes in the accuracy of the LSTM-MFCC-TL and BLSTM-MFCC-TL methods, it can be seen that the architecture using BLSTM as the main model can effectively improve the validation set accuracy by 2.11%. The accuracy of the validation set of our constructed GMM-HMM-MFCCAE method was only 64.48%, which is the worst performance, as shown in
Table 6. This is because the architecture of the GMM-HMM-MFCCAE method was designed to be applied to the classification of words in the study of [
16], and a classification accuracy of more than 90% of the words was obtained. The LSTM-MFCC-TL method, BLSTM-MFCC-TL method, BLSTM-MFCCAE-TL method, and GMM-HMM-MFCCAE method used in the experiment are all speaker recognition methods. Although the GMM-HMM-MFCCAE method was originally designed for text recognition, we took reference from its AE feature architecture. Therefore, we applied this architecture to the speaker recognition during the experiment to ensure that our proposed BLSTM-MFCCAE-TL method can utilize AE features and achieve better accuracy in speaker recognition. From the changes in the accuracy of the BLSTM-MFCC-TL and BLSTM-MFCCAE-TL methods, it was found that the accuracy of the validation set could be effectively improved by 1.9% by adding the input of additional AE features.
Table 7 lists the computation time of each method for the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, BLSTM-MFCCAE-TL, and GMM-HMM-MFCCAE methods. The computation time was compared to understand the operational efficiency of an actual forecast.
From the perspective of the first computation time and average computation time, it can be observed that the three methods proposed in this paper are faster than the traditional GMM-HMM-MFCCAE method. This is because the GMM-HMM model used in the GMM-HMM-MFCCAE method requires higher computation time, along with the computation time for processing the AE features. Therefore, the computation time of our proposed methods outperforms that of the GMM-HMM-MFCCAE method. Among the proposed methods, the proposed BLSTM-MFCCAE-TL method achieves the best performance in terms of learning effectiveness and accuracy. However, due to the additional time required for processing the AE features, the BLSTM-MFCCAE-TL method has a higher computation time compared to both the BLSTM-MFCC-TL method and the LSTM-MFCC-TL method. Nevertheless, the computation time of the BLSTM-MFCCAE-TL method is still superior to the computation time of the GMM-HMM-MFCCAE method.
Finally, we attempted to perform additional training epochs using the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods. The speaker recognition accuracies at 20, 40, 60, and 80 epochs are listed in
Table 8.
We observed that the speaker recognition accuracy of the BLSTM-MFCCAE-TL method with extra AE features increased significantly during the change from epoch 20 to 40. However, the LSTM-MFCC-TL and BLSTM-MFCC-TL methods, without adding AE features, show only a slight increase in speaker recognition accuracy. Then, from the 40th to the 60th epoch, the speaker recognition accuracy of the LSTM-MFCC-TL method without AE features and the BLSTM-MFCC-TL method improved significantly. The BLSTM-MFCCAE-TL method, with additional AE features, exhibited only a minute increase in accuracy. This is because the BLSTM-MFCCAE-TL method has many input features. The accuracy was significantly improved in the previous training epoch, resulting in a less significant increase in speaker recognition accuracy during subsequent training. Finally, from the 60th to the 80th epoch, there was no evident increase in the method’s performance, with or without adding AE features. Among them, the BLSTM-MFCCAE-TL method achieved a 95.03% validation set accuracy after the 80th epoch training, which is the highest validation set accuracy rate in this experiment.