4.2. Experimental Results
After extraction of the four types of features, these are embedded in the multilayer perceptron with a softmax activation function.
Table 2 shows the selected hyperparameter settings. The results of the experiments are shown in
Table 3, which shows that the proposed system outperforms both of the baseline systems with a gain in accuracy of 3.6% and 5.8%. The results suggest the best feature set for the proposed system is MFCCs with VAD. We find that, on the application of VAD, the spectrograms and MFCC feature vectors become of variable size. This is because VAD filters noisy frames from speech signals, with only speech frames filtered out. To handle the variable length of the feature vectors, i.e., to make them fixed length, we perform zero padding using Algorithm 1.
Algorithm 1 Zero padding of the feature vectors |
- 1:
Input: Set of feature vectors - 2:
Output: Zero-padded feature vectors - 3:
Calculate the size i of the largest vector - 4:
for each feature vector do - 5:
Calculate pad size as - 6:
Pad zeros equal to to vector - 7:
end for
|
One of the effects of using VAD before feature calculation is that it reduces the dimensionality of the features (16,281 and 11,457 spectrogram features, and 3240 vs. 2280 MFCC features), which, in turn, decreases the step size, as shown in
Table 4. The model loss during training with different features is presented in
Figure 9. It can be observed that, when using spectrograms, the model converges more slowly compared to during the use of MFCC features. We also notice that the MLP model converges earlier in terms of training loss compared to the LSTM and BiLSTM models. The loss declines sharply after just a few epochs, which supports the idea that the MLP model rapidly learns the input data weights. LSTM and BiLSTM take longer to converge due to their complexity and the need to handle sequential data input dependencies. Additionally, the figure shows that, with the application of VAD, the noise interference in the MLP model is minimized, resulting in a smoother loss curve.
The validation loss shown in
Figure 10 also indicates that MFCC features with VAD outperform both MFCC features without VAD and spectrogram features. The validation loss curves for the MLP, LSTM, and BiLSTM models are displayed. In this experiment, the MLP consistently outperforms both LSTM and BiLSTM across all epochs, maintaining a lower validation loss. This suggests that the MLP is more likely to generalize well to unseen datasets compared to LSTM and BiLSTM, as the latter models exhibit higher fluctuations in validation loss. This is particularly true when MFCC features are combined with VAD, as VAD helps stabilize the validation performance of the MLP model.
The model accuracy is shown in
Figure 11. The MLP model achieves the highest training accuracy, surpassing both LSTM and BiLSTM. The prominent features learned by the MLP model contribute to a steeper rise in accuracy. Although the accuracy of LSTM and BiLSTM improves over the epochs, it remains slightly below that of the MLP model, indicating that these models require more epochs to capture sequential information but still do not reach the accuracy level of MLP. This figure reinforces the idea that using VAD with MFCCs improves model performance, particularly in noisy environments.
The validation accuracy is shown in
Figure 12. As depicted, while the LSTM and BiLSTM models have the lowest validation accuracy, the MLP model performs significantly better. The consistency of the validation accuracy curve for the MLP suggests that it can predict speakers in the test data more reliably. Although LSTM and BiLSTM show steady improvements in accuracy, they experience greater intra-iteration fluctuations compared to the embedding-RNN case, and their rate of accuracy improvement is relatively slower. This further highlights the effectiveness of the MLP model, particularly when using MFCCs in combination with VAD, which leads to superior validation performance.
In both
Figure 11 and
Figure 12, it can be observed that MFCCs with VAD outperform both MFCCs without VAD and spectrogram features with and without VAD. These results also suggest that spectrograms perform poorly for speaker identification compared to MFCC features. After identifying the best feature set with VAD, we compare our proposed model, i.e., the MLP with softmax, against LSTM and BiLSTM classifiers. The accuracy results of the MLP, LSTM, and BiLSTM classifiers are shown in
Table 5.
Note that all of the MLP, LSTM, and BiLSTM models are compared on the following settings:
It can be seen that the proposed system outperforms the LSTM and BiLSTM classifiers.
Figure 13 compares the model accuracy and validation accuracy of the MLP, LSTM, and BiLSTM models. The error rate is consistently lower in the MLP compared to the LSTM and BiLSTM models, which demonstrates more accurate speaker identification. The sharp decline in the MSE for the MLP confirms its ability to reduce prediction errors in less time than more established methods, such as those using VAD. The MSE in the LSTM and BiLSTM models remains slightly higher throughout, suggesting that their sequential nature may introduce more variability in predictions, especially when noise affects the data.
Figure 14 compares the model loss and validation loss of these models, showing that the validation MSE for the MLP continues to be lower than that of LSTM and BiLSTM, proving that the MLP has a better ability to generalize successfully on unseen data. The relatively stable and lower MSE curve for the MLP supports the hypothesis that MFCC features combined with VAD enhance noise robustness. The MSE for both LSTM and BiLSTM, though slightly lower than for MLP, highlights their sensitivity to noise and variable-length data.
To evaluate the performance of our proposed system, we compare it with the following baselines.
Baseline 1: We use a multilayered perceptron without a softmax output layer and with VAD as our first baseline.
Baseline 2: We use a multilayered perceptron with a softmax output layer and without VAD as our second baseline.
We compare the proposed system against these two baseline models, as well as the two following recurrent neural network (RNN) architectures: Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM).
The results clearly show that our method outperforms the two baseline models. The first baseline, based on an MLP with VAD but without the softmax output layer (Baseline 1), provides lower accuracy. This indicates that including softmax allows the model to extract speaker information more effectively and distinguish it from noise, which is crucial in noisy environments. Baseline 2, which lacks VAD, performs the worst of all systems (and significantly worse compared to the proposed system), underscoring the importance of VAD in reducing noise and improving overall accuracy.
The proposed system also outperforms both LSTM and BiLSTM, as shown in
Figure 15 and
Figure 16. This result is somewhat surprising given that RNN architectures like LSTM and BiLSTM are expected to perform well on sequential data. However, the integration of MFCCs with VAD and softmax appears to provide a more powerful feature set for the MLP, outperforming these sequentially concatenated models. Additionally, the oscillations observed in the MLP’s loss may be caused by the learning rate or the complex nature of the dataset, which includes noisy environments. These oscillations are not a major concern, as the general trend shows convergence, and the model continues to improve in accuracy over time. Furthermore, the validation loss remains substantial, indicating that the model has not overfitted and is capable of generalizing to unseen inputs.
Moreover, the proposed system converges faster than the LSTM and BiLSTM models, as shown in
Figure 17 and
Figure 18. A key advantage is its faster convergence; the model learns more quickly, meaning it requires less computing time and fewer resources to optimize performance. This makes the simpler MLP architecture, combined with VAD and softmax, more efficient for speaker identification tasks while still providing comparable performance to more complex LSTM and BiLSTM models. Our proposed system, leveraging an MLP with VAD and softmax, converges faster than RNN architectures and achieves higher accuracy. This makes it an optimal choice for speaker identification tasks, especially in noisy environments. The results highlight the effectiveness of integrating VAD for noise reduction and softmax for enhancing speaker identification, providing a robust solution for real-world applications.