Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Chen, Young-Long; Wang, Neng-Chung; Ciou, Jing-Fong; Lin, Rui-Qi

doi:10.3390/app13127008

Open AccessArticle

Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

¹

Department of Computer Science and Information Engineering, National Taichung University of Science and Technology, Taichung 404336, Taiwan

²

Department of Computer Science and Information Engineering, National United University, Miaoli 360302, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(12), 7008; https://doi.org/10.3390/app13127008

Submission received: 30 April 2023 / Revised: 2 June 2023 / Accepted: 8 June 2023 / Published: 10 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

Recently, neural network technology has shown remarkable progress in speech recognition, including word classification, emotion recognition, and identity recognition. This paper introduces three novel speaker recognition methods to improve accuracy. The first method, called long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL), utilizes MFCC as input features for the LSTM model and incorporates triplet loss and cluster training for effective training. The second method, bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL), enhances speaker recognition accuracy by employing a bidirectional LSTM model. The third method, bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL), utilizes an autoencoder to extract additional AE features, which are then concatenated with MFCC and fed into the BLSTM model. The results showed that the performance of the BLSTM model was superior to the LSTM model, and the method of adding AE features achieved the best learning effect. Moreover, the proposed methods exhibit faster computation times compared to the reference GMM-HMM model. Therefore, utilizing pre-trained autoencoders for speaker encoding and obtaining AE features can significantly enhance the learning performance of speaker recognition. Additionally, it also offers faster computation time compared to traditional methods.

Keywords:

speaker recognition; neural network; long short-term memory; mel-frequency cepstral coefficients; triplet loss

1. Introduction

Speaker recognition is a part of biometric technology. This technology is often used in security detection, such as access control, company punch cards, bank vaults, robots, and smart living. Speaker recognition technologies can be divided into text-dependent [1] and text-independent [2]. In text-dependent speaker recognition, the speaker must utter a sentence with specific content. For example, ref. [1] applied the wake-up word model to speaker recognition. Text-independent speaker recognition has no restrictions on sentence content, and recognition focuses on confirming the speaker’s identity. For example, ref. [2] proposed a text-independent speaker verification method in noisy environments by fusing different features. Text-independent forensic speaker recognition was studied in [3]. Forensic speaker recognition means that when a suspect has left sound evidence, speaker identification needs to be found through the sound [4,5]. Text-dependent speaker recognition is easier to train but more limited because the same sentence needs to be spoken during recognition. Text-independent speaker recognition is challenging because it usually requires more data to achieve better results, but it is relatively more flexible in application. Our research aimed to construct a text-independent speaker recognition method.

We discuss two main aspects of the speaker recognition process: feature extraction and speaker modeling. In previous studies, many techniques have been developed for extracting features from sound signals. For example, linear frequency cepstral coefficients (LFCC) [2,6] and mel-frequency cepstral coefficients (MFCCs) [2,6,7] use filter-bank coefficients for feature extraction. The differences between LFCC and MFCC are described in [2]. In LFCC, sound signals of all frequencies are regarded as equally important. The design of MFCC considers the degree and influence of the perception of different sound frequencies by the human ear. In reality, the human ear is more sensitive to changes in low-frequency sounds; therefore, the MFCC method emphasizes low-frequency sound signals. The MFCC calculation is described in detail in [8]. Kinnunen and Li [9] described some text-independent speaker recognition techniques and discussed and recommended feature selection in their paper, including the use of MFCC features. Sahidullah and Saha [10] proposed a new windowing technique for MFCC computation and used speaker recognition accuracy for comparison with past windowing techniques. In reference [7], dynamic time warping (DTW) and features extracted using MFCC were utilized for speaker recognition. The two feature extraction techniques, LFCC and MFCC, have been used to classify insect singing [6]. The performance achieved from both approaches was good, but most insects’ sounds are high-frequency signals, so using LFCC as the classifier input provides better results than using MFCC. Spoken language identification utilizes MFCC features and gammatone cepstral coefficient features [11]. Because our research direction is to construct a method of speaker recognition for the human voice, we use MFCC for feature extraction, as it is more suitable for recognizing the human voice.

Many models have recently been applied in the field of voice recognition, the primary purpose of which is the recognition output after the model calculates the input data. Hidden Markov models (HMMs) were applied to speaker-independent word recognition for the first time [12]. HMMs use multiple states to construct implicit unknown parameters and are predominantly used for word recognition. Ref. [13] used a Gaussian mixture model (GMM) to calculate the input signal into multiple Gaussian components to represent the spectral shape of the speaker. Because a GMM uses multiple Gaussian components to construct the sound model of the input signal, it is primarily used for identity model construction and recognition. Ref. [14] used a Gaussian mixture model–universal background model (GMM-UBM) to develop a language identification system (LID) and improved the GMM through universal background model (UBM) performance. GMM-UBM has the disadvantage of requiring significant data to construct a model. The UBM uses different categories to establish a general background model and then puts all categories of data into the UBM for parameter adjustment. The recent boom in neural network (NN) technology has resulted in significant progress in the application of artificial intelligence (AI) and has many applications in the field of voice recognition. The architecture of the NN [15] includes the input, hidden, and output layers. The input layer controls the size of the input data. The hidden layer is the core of the entire architecture and is usually designed according to requirements. The output layer is a layer of output data, and the output method varies according to the requirements. For example, a classification outputs the probability of each category, and the speaker recognition method in this study outputs a set of feature vectors through a neural network. A deep denoising autoencoder composed of an NN was used to generate a set of denoised feature vectors, and a GMM-HMM was used for word recognition [16]. Training a single deep neural network (DNN) for speaker and language recognition was proposed in [17]. A method for Chinese text to speech (TTS) using a recurrent neural network (RNN) was proposed in [18]. The differences between RNNs and general NNs are described in [19]. Feedback from the previous layer is added to each hidden layer of the RNN. As the RNN retains the memory of the previous layer, it can be used to process dynamic information. However, as the number of time series increases, the memory retained by the RNN in the early stage gradually disappears. Therefore, RNNs are unsuitable for long-term memory, and long short-term memory (LSTM) overcomes this problem [20]. The introduction of LSTM and RNN is described in detail in [21], and [22] describes how LSTM can improve the problem of the RNN vanishing gradient, a variant of RNN. LSTM adds three gates to control long-term memory: the input, forget, and output gates [23]. The input gate is used to determine which messages need to enter the memory path; the forget gate is used to determine which needs to be deleted in the memory path; and the output gate is used to determine which messages need to be output [24]. In voice recognition research, many applications use LSTM. For example, ref. [25] used an attention-based LSTM for speech emotion recognition in classification. Ref. [26] studied methods to improve the accuracy of speech emotion recognition using nonverbal vocal fragments, and the model for emotion recognition was LSTM. Ref. [27] proposed a bidirectional LSTM (BLSTM), which is composed of two LSTMs in different directions; therefore, it has two memory paths in the forward and reverse directions, and the number of parameters is twice that of LSTM. BLSTM has several applications in voice recognition. Examples include speech gender classification [28], speech emotion recognition [29]. and native language identification in brief speech utterances [30].

This paper proposes three new speaker recognition methods and a new cluster training method. The speaker recognition methods we propose are (1) long short-term memory with mel-frequency cepstral coefficients for triplet loss (LSTM-MFCC-TL) method; (2) bidirectional long short-term memory with mel-frequency cepstral coefficients for triplet loss (BLSTM-MFCC-TL) method; and (3) bidirectional long short-term memory with mel-frequency cepstral coefficients and autoencoder features for triplet loss (BLSTM-MFCCAE-TL) method. In the LSTM-MFCC-TL method, we use MFCC to extract sound features and feed them into LSTM for training. A set of feature vectors representing the speaker is output by LSTM, and the loss function used in training is the triplet loss. Triplet loss is derived from the FaceNet [31] architecture published by Google and applied to face recognition systems. We use triplet loss as the loss function during training because it shortens the distance between all samples from the same speaker and increases the distance between different speakers during the training process. We use the cluster training method for model training in all the methods. The cluster training method divides the training data into small groups before training and trains all groups in sequence. The second method, BLSTM-MFCC-TL, uses a BLSTM model, enabling the network to have a bidirectional memory path. Thus, it can obtain features that are more representative of the speaker from the forward and reverse directions of the time series. In the third method, BLSTM-MFCCAE-TL, a set of autoencoder (AE) features is generated using the pretrained AE in [16] and combines the AE features with the MFCC features for input to the model training. Because the research direction of [16] is denoising word recognition and our direction is speaker recognition, the training purposes and datasets of the two are different and cannot be directly compared. Therefore, we refer to the model architecture of [16] and construct a baseline method for speaker recognition, which consists of a Gaussian mixture model and hidden Markov model with mel-frequency cepstral coefficients and autoencoder features (GMM-HMM-MFCCAE) method. The main contributions of this study are as follows:

The new cluster training method proposed in this study can solve the problem that hardware equipment cannot be loaded during training.
It is confirmed that using BLSTM can yield more helpful speaker features than using LSTM for speaker recognition.
Extracting a set of AE features through a pretrained autoencoder, combining them with MFCC features, and inputting them into the model for training can effectively improve speaker recognition accuracy.

Section 2 describes the three speaker recognition methods proposed in this paper in detail. Section 3 discusses the experimental methods and results of each method proposed in this study. Section 4 presents the conclusions of this paper.

2. Proposed Methods for Speaker Recognition

In previous sound-processing methods, numerous signals imposed a considerable burden on computers. It is challenging to find characteristics in many signals because they contain too much noise and irrelevant signals; thus, extensive audio processing has been required. Many feature extraction techniques have been developed that have made significant contributions in audio signal processing, such as MFCC [2]. In this paper, we study text-independent [2] speaker recognition, which does not need to retain the original signal of the sound. Therefore, MFCC technology will be used to extract the features of the sound sequence. This study utilizes neural network [15] techniques to construct a speaker identification model. Firstly, the results of MFCC feature extraction are used as inputs to the model. After passing through the neural network model, a set of feature vectors representing the speaker’s identity is obtained. Subsequently, similarity distance calculations are used to determine the accuracy of the identification results. This paper presents three methods for speaker recognition in total. In terms of deep learning models, the LSTM-MFCC-TL method and BLSTM-MFCC-TL method employ the LSTM model and BLSTM model for speaker recognition, respectively. The BLSTM-MFCCAE-TL method also utilizes the BLSTM model and employs a pre-trained autoencoder to generate AE features. We encode the speakers’ speech and then concatenate these AE features with the original MFCC features as input for training the BLSTM model. These features encoded by the autoencoder can enhance learning effectiveness and provide better accuracy in speaker recognition.

The main challenge focuses on the proposed BLSTM-MFCCAE-TL method in this paper. In this proposed method, we employ an autoencoder to encode the speaker information in the speech signal and concatenate this set of encoded AE features with MFCC features as inputs to the deep learning model. Since the AE features are the result of speaker encoding, the incorporation of these AE features can effectively enhance the learning performance of the model. In the following subsections, we discuss the three proposed methods in detail.

2.1. Long Short-Term Memory with Mel-Frequency Cepstral Coefficients for Triplet Loss (LSTM-MFCC-TL) Method

The first method in this paper is the LSTM-MFCC-TL method. The speaker recognition process of the proposed LSTM-MFCC-TL method is illustrated in Figure 1. We divide the LSTM-MFCC-TL method into five steps. The first step is to perform MFCC feature extraction on the input audio signal. Feature preprocessing is performed on sound feature sequences of different lengths in the second step. The third step is to group all input features for subsequent training using the cluster training method. We construct the model using LSTM [21] in the fourth step. In the fifth step, the loss function used during training is the triplet loss [31].

2.1.1. MFCC Feature Extraction

It is challenging to observe the characteristics of the original sound signal. Therefore, in the processing of the sound signal, the original sound signal is usually converted into a spectrum signal for feature observation and processing. MFCCs are a well-known feature extraction method in the traditional sound recognition field. The MFCC is a popular feature in speech and audio processing [9]. In the early 1980s, MFCCs were adopted for speech recognition and later introduced for speaker recognition. In our proposed three methods, we use MFCC as the input feature and leverage MFCC’s adoption in human auditory perception. This can effectively capture the important features in speaker recognition.

The MFCC feature extraction process is shown in Figure 2. The sequential process of MFCC feature extraction is pre-emphasis, framing, Hamming windowing, fast Fourier transform (FFT), mel filter bank processing, and discrete cosine transform (DCT).

First, the pre-emphasis compensates for the high frequency of the input sound signal to eliminate the radiation of the vocal cords and lips to compensate for the suppressed high-frequency part. Pre-emphasis was calculated as follows:

y_{i} = x_{i} - α x_{i - 1},

(1)

where

x_{i}

is the

i

-th input value,

α

is the filter coefficient, and

y_{i}

is the

i

-th output after the pre-enhancement. The filter coefficient can be set between 0.9 and 1 and is usually set at 0.95. Framing divides the signal into shorter time frames after the pre-emphasis. The time length of each frame was 20–40 ms. After framing, each frame passed through a window function. The Hamming window function was used here. We use

y_{n}

to represent the output of the

n

th frame after passing through the Hamming window. This is calculated as follows:

y_{n} = x_{n} - w_{n},

(2)

where x_n is the input

n

th frame, and

w_{n}

is the Hamming window for the

n

th frame. Furthermore,

w_{n}

was calculated as follows:

w_{n} = 0.54 - 0.46 \cos (\frac{2 π n}{N - 1}), 0 \leq n \leq N - 1,

(3)

where

N

denotes the total number of frames. An FFT is performed on the output signal through the Hamming window to convert the original signal from the time domain to the frequency domain. The FFT is a fast algorithm for computing the discrete Fourier transform (DFT). The output of the FFT is then converted to the mel-frequency domain through mel filter bank processing. We define

M e l (f)

as the mel-frequency corresponding to the general physical frequency

f

, calculated as follows:

M e l (f) = 2595 \log_{10} (1 + \frac{f}{700}) = 1125 \ln (1 + \frac{f}{700})

(4)

After obtaining the result using

M e l (f)

, the DCT was used to calculate 12 MFCCs for each frame. In addition to these 12 MFCC coefficients, we considered the energy per frame obtained after the framing step above as one of the MFCC features. Therefore, after the sound signal of this method was extracted using MFCC features, 13 MFCC features were obtained.

Next, we used the original 13 MFCC features to compute delta cepstrum, resulting in 13 delta features that represent the temporal changes in the cepstral parameters. The calculation method for delta cepstrum is as follows,

Δ C_{m} (t) = \frac{\sum_{i = - p}^{p} i C_{m} (t + i)}{\sum_{i = - p}^{p} i^{2}},

(5)

where

Δ C

represents the original MFCC features and

Δ C

represents the delta features obtained after applying delta cepstrum.

Δ C_{m} (t)

is the output of the

m

th MFCC feature at time t, and

p

is the number of differential observations before and after time t. Here, we use

p = 1, 2

to calculate the

Δ C

of the two sets of 13 features, plus the original

C

, for 39 MFCC features. Therefore, each audio frame obtains a set of MFCC feature vectors of length 39 after the MFCC feature extraction. However, owing to the different lengths of each audio signal, the number of audio frames is also different.

2.1.2. Feature Preprocessing

Because the length and frequency of the original audio are inconsistent, the number of frames is different, and the number of audio frames on the vertical axis of the MFCC eigenvector is different. The input size must be the same when training the neural network model. Therefore, we normalize the number of frames in the MFCC, such that the number of frames is unified to the same size. For example,

(512, 39)

and

(1024, 39)

are two MFCC feature vectors with different frame numbers. After feature preprocessing, the size of all feature vectors becomes

(N, 39)

, where

N

is the set frame quantity.

Consider the MFCC feature vector

(M, 39)

as an example. If the number of frames

M

is less than N, then

M

must be added to

N

. Therefore, the original MFCC feature vector is merged with the zero vector of size

(N - M, 39)

to obtain an MFCC feature vector of size (N, 39). If the number of frames

M

is greater than

N

, then the original MFCC feature vector is divided into

M / N

feature vectors of positive-integer

C

groups

(N, 39)

. The remaining

M - C \times N

frames are then combined with the zero vectors above to obtain C + 1 MFCC feature vectors of size

(N, 39)

. In this study, the number of frames

N

was set to

512

in the experiment. Therefore, in this step, the size of all features was unified to

(512, 39)

.

2.1.3. Cluster Training Method

During the training of the three methods proposed in this paper, owing to the limitations of hardware equipment, training must be implemented using small samples. The architecture of the cluster training method is shown in Figure 3, and the clustering method used in this study is used as a demonstration. The dataset used in this study has 400 different categories, and we cut the data of all categories into 40 groups, each of which has 10 categories of data. Subsequently, all clusters were trained sequentially, and their loss and accuracy were calculated. When all 40 groups were trained once, each group’s loss value and accuracy rate were averaged to obtain the loss value and accuracy for the training round.

2.1.4. LSTM Model

It is discussed that RNNs utilize their feedback connections to store representations of recent input incidents [22]. However, traditional backpropagation through time or real-time recurrent learning lead to the gradient exploding or vanishing when error signals flow backward in time. To improve error backflow, a novel recurrent network architecture is proposed with an appropriate gradient-based learning algorithm that is LSTM [22]. In the case of noisy input sequences, LSTMs learn to bridge time intervals exceeding 1000 steps without loss of short-timelag capabilities. LSTMs are efficient gradient-based algorithms achieved for an architecture-enforcing constant. They are neither exploding nor vanishing because error flows through internal states of special units. Therefore, the performance of LSTM-based models for processing time series data is better than that of RNN-based models. The LSTM model architecture in the LSTM-MFCC-TL method is shown in Figure 4, and the relevant model parameters are listed in Table 1. A feature vector with a size of

(512, 39)

after feature preprocessing is used as the input, the LSTM architecture is used as the central part of the model, and the dropout layer is used to prevent the model from overfitting. Then, the flattened layer is used to flatten the two-dimensional vector of size

(512, 128)

into a one-dimensional vector of

(1, 512 \times 128)

after the LSTM and dropout layers. Finally, a dense layer with

128

neurons is used as the output of the LSTM model, and the output is a feature vector of

(1, 128)

.

In neural network technology, the neural network is divided into input, hidden, and output layers. In the input layer, we take the preprocessed MFCC feature vector of size

(512, 39)

as input, where

512

is the number of frames of a fixed size during preprocessing. For the hidden layer, we use the LSTM with the number of neurons as

U

and then connect the dropout layer to freeze the data with the ratio of

D

to prevent the model from overtraining. Figure 5 shows a schematic of the LSTM architecture, highlighting that the LSTM model maintains the memory selection of each time step in the LSTM path, so that the LSTM retains the characteristics of long-term memory.

In Figure 5,

L

represents the internal structure of the LSTM. When entering the next time step, part of the memory selection of the current time step is retained in the path. This effect is possible because LSTM has four unique gates: an input information gate (

h_{t}

), a forget gate (

f_{t}

), an input gate (

i_{t}

), and an output gate (

o_{t}

). The architecture of the LSTM model is shown in Figure 6. The calculation of each gate valve in Figure 6 can be represented by the calculation of

g_{t}

, which is as follows:

g_{t} = a c t i v a t i o n (w_{g} x_{t} + u_{g} y_{t - 1} + b_{g}),

(6)

where

g_{t}

is the output gate valve of the

t

th time step, and activation is the activation function of the gate valve;

w_{g} x_{t}

is the information (

x_{t}

) and the weight value (

w_{g}

) assigned to the

t

th time step for each gate;

u_{g} y_{t - 1}

is a prediction value (

y_{t - 1}

) and a weight value (

u_{g}

) assigned to the

t - 1

th time step for each gate valve; and

b_{g}

is the residual value for each gate valve. We then organize the equations of the entire LSTM model as follows:

{\begin{matrix} f_{t} = σ (w_{f} x_{t} + u_{f} y_{t - 1} + b_{f}) \\ i_{t} = σ (w_{i} x_{t} + u_{i} y_{t - 1} + b_{i}) \\ h_{t} = \tanh (w_{h} x_{t} + u_{h} y_{t - 1} + b_{h}) \\ o_{t} = σ (w_{o} x_{t} + u_{o} y_{t - 1} + b_{o}) \\ c_{t} = f_{t} c_{t - 1} + i_{t} h_{t} \\ y_{t} = o_{t} \tanh (c_{t}) \end{matrix},

(7)

where

h_{t}

is the input information gate,

f_{t}

is the forget gate,

i_{t}

is the input gate,

o_{t}

is the output gate,

c_{t}

is the memory neuron that stores the

t

th time step, and

y_{t}

is the predicted value. Both

t a n h

and

σ

are activation functions, where

σ

is a sigmoid function.

2.1.5. Loss Function

During the model’s training, we chose the triplet loss as the loss function. Triplet loss randomly selects a sample from all training samples as the anchor. Then, we randomly select a sample of the same category as an anchor and call it positive. Finally, a sample of a different category from anchor was randomly selected and called negative. The above three represent the triplets in triplet loss, and for the anchor, positive, and negative triplet loss. We can write the following equation:

L = m a x (0, d (A, P) - d (A, N) + M),

(8)

where the anchor, positive, and negative are denoted as

A

,

P

, and

N

, respectively. M is the margin, which is an adjustable parameter greater than

0

;

d ()

is the distance calculation function, and in this case, the Euclidean distance [32,33] is used. We defined

d (x, y)

as the Euclidean distance between

x

and

y

, where

x = (x_{1}, x_{2}, \dots, x_{n})

and

y = (y_{1}, y_{2}, \dots, y_{n})

. The algorithm for d(x,y) is as follows:

d (x, y) = \sqrt{{(x_{1} - y_{1})}^{2} + {(x_{2} - y_{2})}^{2} + \dots + {(x_{i} - y_{i})}^{2}},

(9)

After calculating the positive and negative Euclidean distances of the anchor, we divided triplet loss into the following three types:

L = {\begin{matrix} L_{1}, if d (A, P) < d (A, N) + M \\ L_{2}, if d (A, P) < d (A, N) < d (A, P) + M \\ L_{3}, if d (A, P) > d (A, N) \end{matrix},

(10)

The first case

L_{1}

is called an easy triplet. In this case, the anchor and positive values are close, which is the best case. The second case L₂ is called a semi-hard triplet. In this case, although the anchor is similar to the positive, it is also similar to the negative because the difference between the two is within a margin. The third case

L_{3}

is called the hard triplet. In this case, the distance between the anchor and the positive is considerable, which is the worst case. Figure 7 shows the differences before and after the triplet-loss operation.

The state before training was

L_{3}

. After training, triplet loss shortens the distance between the anchor and the positive and extends the distance between the anchor and the negative. After the training, the state was

L_{1}

. Figure 7 shows that the anchor and the positive have become close, and the distance between the anchor and the negative has increased significantly.

2.2. Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients for Triplet Loss (BLSTM-MFCC-TL) Method

Based on the LSTM-MFCC-TL method combined with BLSTM, we propose a BLSTM-MFCC-TL method. BLSTM is composed of a two-way two-layer LSTM that obtains two-way information, expanding the model’s perception field. The speaker recognition process of the proposed BLSTM-MFCC-TL method is shown in Figure 8. We divided the BLSTM-MFCC-TL method into five steps, where the fourth step differs from the LSTM-MFCC-TL method, and BLSTM [27] is used as the training model.

The fourth step of the BLSTM model in the BLSTM-MFCC-TL method was proposed, as shown in Figure 9, and the parameter settings in the BLSTM model are listed in Table 2. Due to the BLSTM architecture, the number of parameters increased to 256, twice that of the LSTM architecture. Although the BLSTM architecture increases the number of parameters, it can help improve accuracy. The output size of the fully connected layer is the same as that of the LSTM-MFCC-TL method. Therefore, at the end of the fourth step, the BLSTM model outputs a feature vector of size

(1, 128)

.

BLSTM [27] used in the BLSTM model is a bidirectional application of LSTM architecture. Figure 10 shows a schematic of the BLSTM architecture. The operation process of BLSTM is divided into forward LSTM and reverse LSTM, a combination of the bidirectional two-layer LSTM.

In Figure 10, the x-sequence input to the BLSTM is passed into the forward LSTM and the reverse LSTM, and the predicted values of the two LSTMs are combined as the output y-sequence. If a single-direction LSTM architecture is used, the information volume of the LSTM can be increased only in a forward manner. The advantage of BLSTM is that it can obtain two-way information, thereby expanding the perception field of the model. Therefore, BLSTM can achieve better performance than general LSTM.

2.3. Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients and Autoencoder Features for Triplet Loss (BLSTM-MFCCAE-TL) Method

A pre-trained autoencoder was employed to encode the voices of speakers, which achieved better results in word recognition [16]. Based on a given success illustration, employing the same autoencoder architecture in speaker recognition can also improve the accuracy. We use a pre-trained autoencoder to encode MFCC features into AE features for speaker encoding. Therefore, in our proposed BLSTM-MFCCAE-TL method, we achieved better accuracy in speaker recognition by combining MFCC with AE features. We refer to the model architecture of [16] to construct the GMM-HMM-MFCCAE method for speaker recognition. Because the architecture in [16] is used for denoising word recognition, the correct rate for speaker recognition is 64.48%, which is low.

The speaker recognition process of the proposed BLSTM-MFCCAE-TL method is shown in Figure 11. We divided the BLSTM-MFCCAE-TL method into six steps: MFCC feature extraction, feature preprocessing, AE modeling, cluster training, BLSTM modeling, and loss function. In the third step of the BLSTM-MFCCAE-TL method, we refer to [16] to construct a pretrained AE model and combine the AE features output by the AE model with the MFCC features and input them to the BLSTM model in the fifth step of training. The third and fifth steps are described in detail below.

2.3.1. AE Model

The AE architecture for the third step of our proposed BLSTM-MFCCAE-TL method is shown in Figure 12. An AE is composed of a neural network, and the architecture is divided into an encoder and a decoder. The encoder encodes the input data into a set of vectors. The decoder uses the output of the encoder as the input and decodes it. According to [16], additional features obtained from the pretrained AE can improve recognition accuracy. In the AE of the third step, the encoder’s input is the

(N, 39)

MFCC feature, after a layer of 2048 neuron-size neural network, and then outputs 1024 neurons. The decoder then decodes 1024 neurons and outputs the decoding result of

(N, 39)

. In the pretraining stage, the input and output layers of the AE are two audio features of the same speaker. The AE is trained using the sound features of unrelated text to find the feature vector that represents the speaker. After the pretraining of the AE, we take the first 1014 feature vectors of the encoder output layer and reshape them into two-dimensional feature vectors of

(26, 39)

, which we call the AE feature. The MFCC features of

(N, 39)

were then combined with the AE features of

(26, 39)

. In this study,

N

was set as

512

in the experiment. The combined feature vector of size

(512 + 26, 39)

is used as the input of the BLSTM model in the next step.

2.3.2. BLSTM Model

After the BLSTM-MFCCAE-TL method proposed in this paper obtains the AE features through the AE model, the AE and MFCC features are combined into a feature vector of

(538, 39)

, and this set of feature vectors is input into the BLSTM model for training. The fifth step of the BLSTM model architecture in the BLSTM-MFCCAE-TL method is shown in Figure 13, and the parameters in the BLSTM model architecture are listed in Table 3. Because the BLSTM-MFCCAE-TL method adds AE features of size

(26, 39)

, the feature size in the input layer increases from

(512, 39)

in the BLSTM-MFCC-TL method to

(512 + 26, 39)

.

3. Experiments and Analysis

To investigate the performances of the methods in this paper, we execute speaker recognition experiments using the AISHELL-1 dataset [34]. The AISHELL-1 dataset consists of Mandarin speech recorded with a high-fidelity microphone (44.1 kHz, 16-bit). The AISHELL-1 dataset was produced by the audio recorded through the high-fidelity microphone, which was down-sampled to 16 kHz. The AISHELL-1 dataset is a public speech dataset with a total of 400 speakers from different accent regions in China who participated in the recording. In this paper, the AISHELL-1 dataset is used in the speaker recognition. In our experiments, we use the entire dataset of 400 speakers, divided into a training set and validation set with proportions of 90% and 10%, respectively. Table 4 lists the experimental parameters of the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods. The main models in Table 4 refer to the LSTM and BLSTM architectures in the three methods, where the input feature shape is the feature size of the main input model, and the output feature shape is the feature size output by the main model.

Table 4 shows that the LSTM-MFCC-TL and BLSTM-MFCC-TL methods proposed in this paper are all MFCC feature vectors of

(512, 39)

in the main model’s input part. The input of the main model of the BLSTM-MFCCAE-TL method uses a combination of MFCC and AE features, and its size is a feature vector of

(538, 39)

. The main model of the LSTM-MFCC-TL method is the LSTM architecture, and the main model of the BLSTM-MFCC-TL and BLSTM-MFCCAE-TL methods is the BLSTM architecture. The learning rate used during training is 0.00001, and the optimizer is Adam. The batch size is 64. The loss function uses triplet loss. The number of training epochs is 20, and the output feature vector is of size (1, 128).

Next, model training was carried out for the above model and the experimental parameters and methods. During training, the loss value and accuracy rate of each round were recorded to observe each model’s training effect from the numerical changes. First, we compared the loss values of our proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods, as shown in Figure 14 and Table 5.

The six-line graph in Figure 14 shows the change trends of the training and the validation data loss values of the LSTM-MFCC-TL, the BLSTM-MFCC-TL, and the BLSTM-MFCCAE-TL methods in each epoch. In the second epoch, the loss value of each method was significantly reduced. Subsequently, the gap between each method’s training and the validation set loss values becomes increasingly closer, and the training loss value gradually becomes stable. From Table 5, we can see a comparison of the loss values of the three proposed methods in the 20th round, especially to observe whether the training and the validation set loss values between the methods exhibit overfitting. At the 20th epoch, the LSTM-MFCC-TL method proposed in this paper has a loss value of 2.98% for the training set and 1.15% for the validation set. The loss value of the BLSTM-MFCC-TL method was 1.19% for the training set and 0.42% for the validation set. The BLSTM-MFCCAE-TL method had a training set loss of 0.75% and a validation set loss of 0.49%. From the comparison of the loss values of each method, it can be seen that there is not much difference between the loss values of the training and the validation sets, and our method also obtains a lower loss value. Thus, the three proposed methods are all effectively trained. We then compared the accuracy of our proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods, as shown in Figure 15 and Table 6.

Figure 15 shows the changing trends of the training and validation data accuracies in the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods in the first 20 epochs. First, the accuracy of each method exhibits noticeable amplitude changes in the second epoch, and then it increases steadily. The results for the 20th epoch are listed in Table 6.

In this speaker recognition experiment, the accuracy is defined as the ratio of correctly recognizing the speaker to all prediction results. In comparison, the evaluated validation accuracy for the proposed methods, namely LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL, is 89.07%, 91.18%, and 93.08%, respectively. Therefore, using the same BLSTM model and MFCC feature, the BLSTM-MFCCAE-TL method, which incorporates AE features, significantly improved the validation accuracy from 91.18% to 93.08% compared to the BLSTM-MFCC-TL method. The architecture of the GMM-HMM-MFCCAE method [16] was designed for application to word classification and achieved a word classification accuracy of 92.81%. However, the GMM-HMM-MFCCAE method is only 64.48% accurate in the validation set applied to identity person recognition, which is the worst performance, as shown in Table 6. Then, we propose three identification accuracy methods; the LSTM-MFCC-TL method’s accuracy rate is 89.07%. The validation set accuracy of the BLSTM-MFCC-TL method is 91.18%. The validation set accuracy of the BLSTM-MFCCAE-TL method is 93.08%. From the changes in the accuracy of the LSTM-MFCC-TL and BLSTM-MFCC-TL methods, it can be seen that the architecture using BLSTM as the main model can effectively improve the validation set accuracy by 2.11%. The accuracy of the validation set of our constructed GMM-HMM-MFCCAE method was only 64.48%, which is the worst performance, as shown in Table 6. This is because the architecture of the GMM-HMM-MFCCAE method was designed to be applied to the classification of words in the study of [16], and a classification accuracy of more than 90% of the words was obtained. The LSTM-MFCC-TL method, BLSTM-MFCC-TL method, BLSTM-MFCCAE-TL method, and GMM-HMM-MFCCAE method used in the experiment are all speaker recognition methods. Although the GMM-HMM-MFCCAE method was originally designed for text recognition, we took reference from its AE feature architecture. Therefore, we applied this architecture to the speaker recognition during the experiment to ensure that our proposed BLSTM-MFCCAE-TL method can utilize AE features and achieve better accuracy in speaker recognition. From the changes in the accuracy of the BLSTM-MFCC-TL and BLSTM-MFCCAE-TL methods, it was found that the accuracy of the validation set could be effectively improved by 1.9% by adding the input of additional AE features.

Table 7 lists the computation time of each method for the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, BLSTM-MFCCAE-TL, and GMM-HMM-MFCCAE methods. The computation time was compared to understand the operational efficiency of an actual forecast.

From the perspective of the first computation time and average computation time, it can be observed that the three methods proposed in this paper are faster than the traditional GMM-HMM-MFCCAE method. This is because the GMM-HMM model used in the GMM-HMM-MFCCAE method requires higher computation time, along with the computation time for processing the AE features. Therefore, the computation time of our proposed methods outperforms that of the GMM-HMM-MFCCAE method. Among the proposed methods, the proposed BLSTM-MFCCAE-TL method achieves the best performance in terms of learning effectiveness and accuracy. However, due to the additional time required for processing the AE features, the BLSTM-MFCCAE-TL method has a higher computation time compared to both the BLSTM-MFCC-TL method and the LSTM-MFCC-TL method. Nevertheless, the computation time of the BLSTM-MFCCAE-TL method is still superior to the computation time of the GMM-HMM-MFCCAE method.

Finally, we attempted to perform additional training epochs using the proposed LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL methods. The speaker recognition accuracies at 20, 40, 60, and 80 epochs are listed in Table 8.

We observed that the speaker recognition accuracy of the BLSTM-MFCCAE-TL method with extra AE features increased significantly during the change from epoch 20 to 40. However, the LSTM-MFCC-TL and BLSTM-MFCC-TL methods, without adding AE features, show only a slight increase in speaker recognition accuracy. Then, from the 40th to the 60th epoch, the speaker recognition accuracy of the LSTM-MFCC-TL method without AE features and the BLSTM-MFCC-TL method improved significantly. The BLSTM-MFCCAE-TL method, with additional AE features, exhibited only a minute increase in accuracy. This is because the BLSTM-MFCCAE-TL method has many input features. The accuracy was significantly improved in the previous training epoch, resulting in a less significant increase in speaker recognition accuracy during subsequent training. Finally, from the 60th to the 80th epoch, there was no evident increase in the method’s performance, with or without adding AE features. Among them, the BLSTM-MFCCAE-TL method achieved a 95.03% validation set accuracy after the 80th epoch training, which is the highest validation set accuracy rate in this experiment.

4. Conclusions

This paper proposed three new speaker recognition methods: LSTM-MFCC-TL, BLSTM-MFCC-TL, and BLSTM-MFCCAE-TL. First, we combined LSTM with MFCC features and used the triplet loss function to propose the LSTM-MFCC-TL method. The cluster training method proposed in this paper was introduced in detail and was used to improve triplet loss training and prediction in environments with poor hardware equipment. We then proposed the BLSTM-MFCC-TL method based on the LSTM-MFCC-TL method combined with the BLSTM model, and the speaker recognition accuracy was improved. It was also confirmed that the accuracy of using the BLSTM model training for speaker recognition is better than that of using the LSTM model. The third method proposed in this study was based on the BLSTM-MFCC-TL method combined with a pretrained AE to extract AE features, yielding the BLSTM-MFCCAE-TL method. Although the BLSTM-MFCCAE-TL method required more computation time than the BLSTM-MFCC-TL method without adding AE features, it achieved better accuracy. After the three proposed methods were trained for 20 epochs, the LSTM-MFCC-TL method achieved an accuracy of 89.07%. The BLSTM-MFCC-TL method achieved 91.18% validation set accuracy. The BLSTM-MFCCAE-TL method achieved a validation accuracy of 93.08%. The AE architecture used in this paper is primarily based on reference [16], so we also adopted this architecture for experimental comparison. Since the research in [16] was conducted for denoising word recognition, we adapted the architecture described in [16] to develop a GMM-HMM-MFCCAE method specifically for speaker recognition to compare the training effectiveness. The GMM-HMM-MFCCAE method achieved a validation accuracy of 64.48%. The average computation times for the LSTM-MFCC-TL, BLSTM-MFCC-TL, BLSTM-MFCCAE-TL, and GMM-HMM-MFCCAE methods were 0.2698 s, 0.3856 s, 0.4052 s, and 1.5286 s, respectively. From the results, it can be observed that the accuracy of the BLSTM model is 2.11% higher than that of the LSTM model. This indicates that the BLSTM is able to better capture the features of the speaker. The BLSTM-MFCCAE-TL method, which combines MFCC features and AE features, achieves a 1.9% improvement in speaker identification accuracy compared to the BLSTM-MFCC-TL method. Moreover, our proposed three methods outperform the reference framework’s GMM-HMM-MFCCAE method in terms of speaker identification accuracy. In terms of computation time, the BLSTM-MFCCAE-TL method requires some additional processing time due to the inclusion of AE features, but it yields good learning performance. However, compared to the reference architecture GMM-HMM-MFCCAE method, the proposed three methods all demonstrate better computation efficiency. In the experiments of our BLSTM-MFCCAE-TL method, we have demonstrated that incorporating speaker encodings into AE features can effectively enhance speaker identification accuracy. Finally, the performances of our proposed methods outperform the performances of the traditional GMM-HMM-MFCCAE method in terms of computation time and speaker identification accuracy.

Author Contributions

Conceptualization, Y.-L.C.; methodology, Y.-L.C. and N.-C.W.; software, J.-F.C. and R.-Q.L.; validation, J.-F.C., N.-C.W. and R.-Q.L.; formal analysis, Y.-L.C. and N.-C.W.; investigation, N.-C.W. and J.-F.C.; resources, N.-C.W. and Y.-L.C.; data curation, J.-F.C. and R.-Q.L.; writing—original draft preparation, J.-F.C., Y.-L.C. and R.-Q.L.; writing—review and editing, Y.-L.C. and R.-Q.L.; visualization, J.-F.C. and R.-Q.L.; supervision, Y.-L.C.; project administration, Y.-L.C.; funding acquisition, Y.-L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partly supported by the National Science and Technology Council (NSTC) of the Republic of China under Grant No. MOST 109-2221-E-025-008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tsai, T.-H.; Hao, P.-C.; Wang, C.-L. Self-defined text-dependent wake-up-words speaker recognition system. IEEE Access 2021, 9, 138668–138676. [Google Scholar] [CrossRef]
Mohammadi, M.; Sadegh Mohammadi, H.R. Robust features fusion for text independent speaker verification enhancement in noisy environments. In Proceedings of the Iranian Conference on Electrical Engineering, Tehran, Iran, 2–4 May 2017; pp. 1863–1868. [Google Scholar]
Wang, Z.; Hansen, J.H.L. Multi-source domain adaptation for text-independent forensic speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 60–75. [Google Scholar] [CrossRef]
Campbell, J.P.; Shen, W.; Campbell, W.M.; Schwartz, R.; Bonastre, J.-F.; Matrouf, D. Forensic speaker recognition. IEEE Signal Process. Mag. 2009, 26, 95–103. [Google Scholar] [CrossRef]
Hansen, J.H.L.; Hasan, T. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag. 2015, 32, 74–99. [Google Scholar] [CrossRef]
Noda, J.J.; Travieso-González, C.M.; Sánchez-Rodríguez, D.; Alonso-Hernández, J.B. Acoustic classification of singing insects based on MFCC/LFCC fusion. Appl. Sci. 2019, 9, 4097. [Google Scholar] [CrossRef] [Green Version]
Muda, L.; Begam, M.; Elamvazuthi, I. Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. J. Comput. 2010, 2, 138–143. [Google Scholar]
Dighore, V.D.; Thool, V.R. Analysis of asthma by using Mel frequency cepstral coefficient. In Proceedings of the IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology, Bangalore, India, 20–21 May 2016; pp. 976–980. [Google Scholar]
Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 2010, 50, 12–40. [Google Scholar] [CrossRef] [Green Version]
Sahidullah, M.; Saha, G. A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Process. Lett. 2013, 20, 149–152. [Google Scholar] [CrossRef] [Green Version]
Alashban, A.A.; Qamhan, M.A.; Meftah, A.H.; Alotaibi, Y.A. Spoken language identification system using convolutional recurrent neural network. Appl. Sci. 2022, 12, 9181. [Google Scholar] [CrossRef]
Lee, K.-F.; Hon, H.-W. Speaker-independent phone recognition using hidden Markov models. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 1641–1648. [Google Scholar] [CrossRef] [Green Version]
Reynolds, D.A.; Rose, R.C. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 1995, 3, 72–83. [Google Scholar] [CrossRef] [Green Version]
Kumar, V.R.; Vydana, H.K.; Vuppala, A.K. Significance of GMM-UBM based modelling for indian language identification. Procedia Comput. Sci. 2015, 54, 231–236. [Google Scholar] [CrossRef] [Green Version]
Sze, V.; Chen, Y.-H.; Yang, T.-J.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey. Proc. IEEE 2017, 105, 2295–2329. [Google Scholar] [CrossRef] [Green Version]
Grozdić, Đ.T.; Jovičić, S.T.; Subotić, M. Whispered speech recognition using deep denoising autoencoder. Eng. Appl. Artif. Intell. 2017, 59, 15–22. [Google Scholar] [CrossRef]
Richardson, F.; Reynolds, D.; Dehak, N. Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 2015, 22, 1671–1675. [Google Scholar] [CrossRef]
Chen, S.-H.; Hwang, S.-H.; Wang, Y.-R. An RNN-based prosodic information synthesizer for mandarin text-to-speech. IEEE Trans. Speech Audio Process. 1998, 6, 226–239. [Google Scholar] [CrossRef] [Green Version]
Malhi, A.; Yan, R.; Gao, R.X. Prognosis of defect propagation based on recurrent neural networks. IEEE Trans. Instrum. Meas. 2011, 60, 703–711. [Google Scholar] [CrossRef]
Adam, K.; Smagulova, K.; James, A.P. Memristive LSTM network hardware architecture for time-series predictive modeling problems. In Proceedings of the IEEE Asia Pacific Conference on Circuits and Systems, Chengdu, China, 26–30 October 2018; pp. 459–462. [Google Scholar]
Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D NonLinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef] [Green Version]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Ergen, T.; Kozat, S.S. Online training of LSTM networks in distributed systems for variable length data sequences. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5159–5165. [Google Scholar] [CrossRef] [Green Version]
Du, J.; Vong, C.-M.; Chen, C.L.P. Novel efficient RNN and LSTM-like architectures: Recurrent and gated broad learning systems and their applications for text classification. IEEE Trans. Cybern. 2021, 51, 1586–1597. [Google Scholar] [CrossRef] [PubMed]
Xie, Y.; Liang, R.; Liang, Z.; Huang, C.; Zou, C.; Schuller, B. Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1675–1685. [Google Scholar] [CrossRef]
Hsu, J.-H.; Su, M.-H.; Wu, C.-H.; Chen, Y.-H. Speech emotion recognition considering nonverbal vocalization in affective conversations. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1675–1686. [Google Scholar] [CrossRef]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
Alamsyah, R.D.; Suyanto, S. Speech gender classification using bidirectional long short term memory. In Proceedings of the International Seminar on Research of Information Technology and Intelligent Systems, Yogyakarta, Indonesia, 10–11 December 2020; pp. 646–649. [Google Scholar]
Mustaqeem; Sajjad, M.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM. IEEE Access 2020, 8, 79861–79875. [Google Scholar] [CrossRef]
Adeeba, F.; Hussain, S. Native language identification in very short utterances using bidirectional long short-term memory network. IEEE Access 2019, 7, 17098–17110. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Danielsson, P.-E. Euclidean distance mapping. Comput. Graph. Image Process. 1980, 14, 227–248. [Google Scholar] [CrossRef] [Green Version]
Kabir, M.M.; Mridha, M.F.; Shin, J.; Jahan, I.; Ohi, A.Q. A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access 2021, 9, 79236–79263. [Google Scholar] [CrossRef]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, Seoul, Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]

Figure 1. LSTM-MFCC-TL method architecture diagram.

Figure 2. MFCC feature extraction flowchart.

Figure 3. Cluster training method architecture diagram.

Figure 4. LSTM model architecture diagram in LSTM-MFCC-TL method.

Figure 5. Schematic diagram of the LSTM architecture.

Figure 6. LSTM internal architecture diagram.

Figure 7. Schematic diagram of the operation of triplet loss.

Figure 8. BLSTM-MFCC-TL method architecture diagram.

Figure 9. BLSTM model architecture diagram in the BLSTM-MFCC-TL method.

Figure 10. Schematic diagram of the BLSTM architecture.

Figure 11. BLSTM-MFCCAE-TL method architecture diagram.

Figure 12. Autoencoder model architecture diagram.

Figure 13. BLSTM model architecture diagram in the BLSTM-MFCCAE-TL method.

Figure 14. We propose comparing the top 20 epochs loss values of the three methods.

Figure 15. Comparison of the first 20-epoch accuracy of our three proposed methods.

Table 1. LSTM model parameter table in LSTM-MFCC-TL method.

Layers	Size	Information
Input	(512, 39)
LSTM	(512, 128)	Units: 128
Dropout	(512, 128)	Drop: 0.3
Flatten	(1, 65,536)
Dense (Output)	(1, 128)

Table 2. BLSTM model parameter table in BLSTM-MFCC-TL method.

Layers	Size	Information
Input	(512, 39)
BLSTM	(512, 256)	Units: 128
Dropout	(512, 256)	Drop: 0.3
Flatten	(1, 131,072)
Dense (Output)	(1, 128)

Table 3. BLSTM model parameter table in BLSTM-MFCCAE-TL method.

Layers	Size	Information
Input	(538, 39)
BLSTM	(538, 256)	Units: 128
Dropout	(538, 256)	Drop: 0.3
Flatten	(1, 137,728)
Dense (Output)	(1, 128)

Table 4. Experimental parameters of our three proposed methods.

Method Name	LSTM-MFCC-TL	BLSTM-MFCC-TL	BLSTM-MFCCAE-TL
Input features	MFCC	MFCC	MFCC and AE
Main model	LSTM	BLSTM	BLSTM
Learning rate	0.00001	0.00001	0.00001
Loss function	Triplet loss	Triplet loss	Triplet loss
Optimizer	Adam	Adam	Adam
Batch size	64	64	64
Epochs	20	20	20
Input feature shape	(512, 39)	(512, 39)	(538, 39)
Output feature shape	(1, 128)	(1, 128)	(1, 128)

Table 5. We present a comparison table of the 20th epoch loss value of the three methods.

Method Name	Train Loss Value	Validation Loss Value
LSTM-MFCC-TL	2.98%	1.15%
BLSTM-MFCC-TL	1.19%	0.42%
BLSTM-MFCCAE-TL	0.75%	0.49%

Table 6. We propose a comparison table of the 20-epoch accuracy of the three methods and the method constructed in reference [16].

Method Name	Train Accuracy	Validation Accuracy
GMM-HMM-MFCCAE	65.60%	64.48%
LSTM-MFCC-TL	91.49%	89.07%
BLSTM-MFCC-TL	93.68%	91.18%
BLSTM-MFCCAE-TL	96.01%	93.08%

Table 7. Computation time comparison table of each method.

Method Name	First Computation Time (s)	Average Computation Time (s)
GMM-HMM-MFCCAE	1.5195	1.5286
LSTM-MFCC-TL	0.6045	0.2698
BLSTM-MFCC-TL	0.5844	0.3856
BLSTM-MFCCAE-TL	0.9677	0.4052

Table 8. Speaker recognition accuracy comparison table of the three proposed methods at the 20th, 40th, 60th, and 80th epoch.

Epoch		LSTM-MFCC-TL	BLSTM-MFCC-TL	BLSTM-MFCCAE-TL
20th	Train accuracy	91.49%	93.68%	96.01%
20th	Validation accuracy	89.07%	91.18%	93.08%
40th	Train accuracy	91.50%	94.22%	97.39%
40th	Validation accuracy	89.36%	91.38%	94.17%
60th	Train accuracy	94.68%	96.64%	98.06%
60th	Validation accuracy	91.80%	93.69%	94.78%
80th	Train accuracy	95.30%	97.06%	98.49%
80th	Validation accuracy	92.70%	94.10%	95.03%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.-L.; Wang, N.-C.; Ciou, J.-F.; Lin, R.-Q. Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition. Appl. Sci. 2023, 13, 7008. https://doi.org/10.3390/app13127008

AMA Style

Chen Y-L, Wang N-C, Ciou J-F, Lin R-Q. Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition. Applied Sciences. 2023; 13(12):7008. https://doi.org/10.3390/app13127008

Chicago/Turabian Style

Chen, Young-Long, Neng-Chung Wang, Jing-Fong Ciou, and Rui-Qi Lin. 2023. "Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition" Applied Sciences 13, no. 12: 7008. https://doi.org/10.3390/app13127008

APA Style

Chen, Y. -L., Wang, N. -C., Ciou, J. -F., & Lin, R. -Q. (2023). Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition. Applied Sciences, 13(12), 7008. https://doi.org/10.3390/app13127008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combined Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients Using Autoencoder for Speaker Recognition

Abstract

1. Introduction

2. Proposed Methods for Speaker Recognition

2.1. Long Short-Term Memory with Mel-Frequency Cepstral Coefficients for Triplet Loss (LSTM-MFCC-TL) Method

2.1.1. MFCC Feature Extraction

2.1.2. Feature Preprocessing

2.1.3. Cluster Training Method

2.1.4. LSTM Model

2.1.5. Loss Function

2.2. Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients for Triplet Loss (BLSTM-MFCC-TL) Method

2.3. Bidirectional Long Short-Term Memory with Mel-Frequency Cepstral Coefficients and Autoencoder Features for Triplet Loss (BLSTM-MFCCAE-TL) Method

2.3.1. AE Model

2.3.2. BLSTM Model

3. Experiments and Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI