Voiceprint Recognition under Cross-Scenario Conditions Using Perceptual Wavelet Packet Entropy-Guided Efficient-Channel-Attention–Res2Net–Time-Delay-Neural-Network Model
Abstract
:1. Introduction
- Speech feature extraction algorithms are one of the key technologies in voiceprint recognition. They are used to extract speech features from speech signals. Currently, the most widely used feature extraction methods are MFCC [14] and Fbank [15]. However, these traditional feature extraction methods have limited expressiveness for non-stationary signals and are sensitive to noise [16]. Some researchers have proposed using wavelet transform (WT) for extracting features from speech signals. Algorithms based on wavelet analysis, such as cepstral features, eliminate the need for Mel filtering and compress the spectral information of speech using the perceptual characteristics of wavelet transform. This simplifies the feature extraction process [17,18]. Additionally, utilizing Shannon entropy, which has a good stability, discriminability, and resistance to interference, researchers have further proposed the use of Shannon entropy extraction algorithms based on wavelet packet transform (WPT). This type of feature is composed of the entropy of the sub-band power spectrum of the signal’s wavelet, which is capable of representing abrupt changes in the signal and has a low dimensionality, making it suitable for representing non-stationary signals [19,20]. To enhance the analysis capability of WPT for speech signals and reduce its computational complexity, researchers have also introduced the perceptual wavelet packet entropy (PWPE) feature extraction algorithm. It accurately analyses speech information, suppresses acoustic noise, reduces the number of parameters, and shortens the feature extraction time [21,22,23].
- The objective of the research on speaker modeling is to develop speaker models capable of extracting speaker identity information from speech features. Popular speaker recognition models, such as ECAPA-TDNN utilizing time-delay neural networks and r-vector models employing deep residual networks, have demonstrated outstanding performances in text-independent speaker recognition tasks [24,25]. ECAPA-TDNN, proposed by Desplanques et al. from the University of Mons in Belgium in 2020, introduced the squeeze-excitation (SE) module and the channel attention mechanism for the first time [26,27,28]. This approach won first place in the international speaker recognition competition. However, ECAPA-TDNN is susceptible to noise interference, leaving room for improvement. Firstly, integrating noise reduction techniques can be beneficial in pre-processing speech features during feature extraction. Additionally, feature enhancement techniques can be employed to enhance the characteristics of the speech. Moreover, enhancing the network depth and attention pooling in the ECAPA-TDNN model can further improve its robustness. These enhancements strive to mitigate the impact of noise on the model’s performance.
- The ECAPA-TDNN model is improved, utilizing denoised PWPE features as input. This effectively reduces the noise interference in the model’s input features and enhances its ability to resist noise.
- The ECAPA-TDNN model structure is improved through increasing the network depth and learning channel weights, incorporating attention-based statistical pooling, and optimizing the loss function. These modifications result in a model with a better robustness. This paper conducts speaker recognition experiments in different scenarios using the proposed method and model. The results of the experiments demonstrate that the improved model outperforms the baseline ECAPA-TDNN model based on MFCC features in terms of robustness in cross-scene recognition.
- In order to investigate the recognition performance of the current mainstream models in different scenarios, a set of experiments is designed for both single- and multi-scene conditions. Based on the experimental results, improvements are made to the ECAPA-TDNN model, which shows a better cross-scene recognition performance. The ECAPA-TDNN model is used as the baseline for comparison with the designed model.
2. Methods
2.1. Model Architecture
2.2. The Feature Extraction Method for PWPE
2.3. Speech Enhancement Based on Wavelet Thresholding
- The hard thresholding function; the denoising process can be represented as follows:
- The soft thresholding function; the denoising process can be represented as follows:
- The compromise thresholding function, represented in the denoising process, is expressed as follows:
2.4. Speaker Recognition Model ECA-Res2Net-TDNN
3. Results and Discussion
3.1. Dataset Construction and Analysis
3.2. Dataset Partitioning and Recognition Performance of Different Scenes
- Quiet environments, such as normal indoor conversations and silent libraries, are typically around 40–60 decibels. At this point, speech can be heard clearly.
- Normal conversations, TV volume, office noise, city traffic noise, etc., are generally around 60–70 decibels, and speech can be heard normally at this level.
- High-volume music, vehicle noise, construction site noise, etc., are typically around 70–100 decibels, making it difficult to hear speech clearly at this level.
3.3. Ablation Experiments on Different Feature Extraction Methods and Models
3.4. Model Complexity and Recognition Time of Different Models
3.5. Ablation Experiments on Different Modules of PWPE/ECA-Res2Net-TDNN
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gui, S.; Zhou, C.; Wang, H.; Gao, T. Application of Voiceprint Recognition Technology Based on Channel Confrontation Training in the Field of Information Security. Electronics 2023, 12, 3309. [Google Scholar] [CrossRef]
- Li, S.-A.; Liu, Y.-Y.; Chen, Y.-C.; Feng, H.-M.; Shen, P.-K.; Wu, Y.-C. Voice Interaction Recognition Design in Real-Life Scenario Mobile Robot Applications. Appl. Sci. 2023, 13, 3359. [Google Scholar] [CrossRef]
- Cheng, S.; Shen, Y.; Wang, D. Target Speaker Extraction by Fusing Voiceprint Features. Appl. Sci. 2022, 12, 8152. [Google Scholar] [CrossRef]
- Ye, F.; Yang, J. A Deep Neural Network Model for Speaker Identification. Appl. Sci. 2021, 11, 3603. [Google Scholar] [CrossRef]
- Yao, W.; Xu, Y.; Qian, Y.; Sheng, G.; Jiang, X. A Classification System for Insulation Defect Identification of Gas-Insulated Switchgear (GIS), Based on Voiceprint Recognition Technology. Appl. Sci. 2020, 10, 3995. [Google Scholar] [CrossRef]
- Shi, Y.; Zhou, J.; Long, Y.; Li, Y.; Mao, H. Addressing Text-Dependent Speaker Verification Using Singing Speech. Appl. Sci. 2019, 9, 2636. [Google Scholar] [CrossRef]
- Uyulan, Ç.; Mayor, D.; Steffert, T.; Watson, T.; Banks, D. Classification of the Central Effects of Transcutaneous Electroacupuncture Stimulation (TEAS) at Different Frequencies: A Deep Learning Approach Using Wavelet Packet Decomposition with an Entropy Estimator. Appl. Sci. 2023, 13, 2703. [Google Scholar] [CrossRef]
- Sun, T.; Wang, X.; Zhang, K.; Jiang, D.; Lin, D.; Jv, X.; Ding, B.; Zhu, W. Medical Image Authentication Method Based on the Wavelet Packet and Energy Entropy. Entropy 2022, 24, 798. [Google Scholar] [CrossRef]
- Zhang, Y.; Xie, X.; Li, H.; Zhou, B. An Unsupervised Tunnel Damage Identification Method Based on Convolutional Variational Auto-Encoder and Wavelet Packet Analysis. Sensors 2022, 22, 2412. [Google Scholar] [CrossRef]
- Lei, L.; She, K. Identity Vector Extraction by Perceptual Wavelet Packet Entropy and Convolutional Neural Network for Voice Authentication. Entropy 2018, 20, 600. [Google Scholar] [CrossRef]
- Lei, L.; She, K. Speaker Recognition Using Wavelet Cepstral Coefficient, I-Vector, and Cosine Distance Scoring and Its Application for Forensics. J. Electr. Comput. Eng. 2016, 2016, 462–472. [Google Scholar] [CrossRef]
- Daqrouq, K.; Sweidan, H.; Balamesh, A.; Ajour, M.N. Off-Line Handwritten Signature Recognition by Wavelet Entropy and Neural Network. Entropy 2017, 6, 252. [Google Scholar] [CrossRef]
- Dawalatabad, N.; Ravanelli, M.; Grondin, F.; Thienpondt, J.; Desplanques, B.; Na, H. ECAPA-TDNN Embeddings for Speaker Diarization. arXiv 2021, arXiv:2104.01466. [Google Scholar]
- Jung, S.-Y.; Liao, C.-H.; Wu, Y.-S.; Yuan, S.-M.; Sun, C.-T. Efficiently Classifying Lung Sounds through Depthwise Separable CNN Models with Fused STFT and MFCC Features. Diagnostics 2021, 11, 732. [Google Scholar] [CrossRef]
- Joy, N.M.; Oglic, D.; Cvetkovic, Z.; Bell, P.; Renals, S. Deep Scattering Power Spectrum Features for Robust Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 1673–1677. [Google Scholar]
- Gao, Z.; Song, Y.; McLoughlin, I.; Li, P.; Jiang, Y.; Dai, L.-R. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 361–365. [Google Scholar]
- Bousquet, P.-M.; Rouvier, M.; Bonastre, J.-F. Reliability criterion based on learning-phase entropy for speaker recognition with neural network. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 281–285. [Google Scholar]
- Sang, M.; Hansen, J.H.L. Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 321–325. [Google Scholar]
- Stafylakis, T.; Mosner, L.; Plchot, O.; Rohdin, J.; Silnova, A.; Burget, L.; Černocký, J. Training speaker embedding extractors using multi-speaker audio with unknown speaker boundaries. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 605–609. [Google Scholar]
- Luu, C.; Renals, S.; Bell, P. Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations. In Proceedings of the Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; pp. 610–614. [Google Scholar]
- Zhu, H.; Lee, K.A.; Li, H. Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 106–110. [Google Scholar]
- Li, G.; Liang, S.; Nie, S.; Liu, W.; Yang, Z.; Xiao, L. Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-Channel Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 51–55. [Google Scholar]
- Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2010, 19, 788–798. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
- Liu, Y.; He, L.; Liu, W.; Liu, J. Exploring a unified attention based pooling framework for speaker verification. In Proceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan, 26–29 November 2018; pp. 200–204. [Google Scholar]
- Cai, D.; Wang, W.; Li, M. An iterative framework for selfsupervised deep speaker representation learning. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6728–6732. [Google Scholar]
- Yang, J.; Jiang, J. Dilated-CBAM: An Efficient Attention Network with Dilated Convolution. In Proceedings of the IEEE International Conference on Unmanned Systems (ICUS), Beijing, China, 15–17 October 2021; pp. 11–15. [Google Scholar]
- Liu, R.; Cai, W.; Li, G.; Ning, X.; Jiang, Y. Hybrid Dilated Convolution Guided Feature Filtering and Enhancement Strategy for Hyperspectral Image Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 5508105. [Google Scholar] [CrossRef]
- Yang, L.; Chen, W.; Wang, H.; Chen, Y. Deep learning seismic random Noise attenuation via improved residual convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 59, 7968–7981. [Google Scholar] [CrossRef]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar]
- Haitao, C.; Yu, L.; Yun, Y. Research on voiceprint Recognition system based on ECAPA-TDNN-GRU architecture. In Proceedings of the International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 24–26 February 2023; pp. 1508–1513. [Google Scholar]
- Li, J.; Xu, Q.; Kadoch, M. A Study Of Voiceprint Recognition Technology Based on Deep Learning. In Proceedings of the International Wireless Communications and Mobile Computing (IWCMC), Dubrovnik, Croatia, 30 May–3 June 2022; pp. 24–27. [Google Scholar]
- Dong, X.; Song, J. Application of Voiceprint Recognition Based on Improved ECAPA-TDNN. In Proceedings of the International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou, China, 9–11 December 2022; pp. 1196–1199. [Google Scholar]
- Bayerl, S.P.; Wagner, D.; Baumann, I.; Bocklet, T.; Riedhammer, K. Detecting Vocal Fatigue with Neural Embeddings. J. Voice 2023, 1, 3428–3439. [Google Scholar] [CrossRef]
- Zhu, H.; Lee, K.A.; Li, H. Discriminative speaker embedding with serialized multi-layer multi-head attention. Speech Commun. 2022, 144, 89–100. [Google Scholar] [CrossRef]
- Strake, M.; Defraene, B.; Fluyt, K.; Tirry, W.; Fingscheidt, T. INTERSPEECH 2020 Deep Noise Suppression Challenge: A Fully Convolutional Recurrent Network (FCRN) for Joint Dereverberation and Denoising. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 2467–2471. [Google Scholar]
- Li, Y.; Zhang, X.; Zhang, X.; Li, H.; Zhang, W. Unconstrained vocal pattern recognition algorithm based on attention mechanism. Digit. Signal Process. 2023, 136, 103973. [Google Scholar] [CrossRef]
- Lin, S.; Zhang, M.; Cheng, X.; Zhou, K.; Zhao, S.; Wang, H. Hyperspectral anomaly detection via sparse representation and collaborative representation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 946–961. [Google Scholar] [CrossRef]
- Lin, S.; Zhang, M.; Cheng, X.; Wang, L.; Xu, M.; Wang, H. Hyperspectral Anomaly Detection via Dual Dictionaries Construction Guided by Two-Stage Complementary Decision. Remote Sens. 2022, 14, 1784. [Google Scholar] [CrossRef]
- Zi, Y.; Xiong, S. Joint filter combination-based central difference feature extraction and attention-enhanced Dense-Res2Block network for short-utterance speaker recognition. Expert Syst. Appl. 2023, 233, 1–12. [Google Scholar] [CrossRef]
- Hanifa, R.M.; Isa, K.; Mohamad, S. A review on speaker recognition: Technology and challenges. Comput. Electr. Eng. 2021, 90, 1–14. [Google Scholar]
- Lin, S.; Zhang, M.; Cheng, X.; Zhou, K.; Zhao, S.; Wang, H. Dual Collaborative Constraints Regularized Low-Rank and Sparse Representation via Robust Dictionaries Construction for Hyperspectral Anomaly Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 2009–2024. [Google Scholar] [CrossRef]
- Lin, S.; Zhang, M.; Cheng, X.; Zhao, S.; Shi, L.; Wang, H. Hyperspectral Anomaly Detection Using Spatial–Spectral-Based Union Dictionary and Improved Saliency Weight. Remote Sens. 2023, 15, 3609. [Google Scholar] [CrossRef]
- Tsao, Y.; Lin, T.-H.; Chen, F.; Chang, Y.-F.; Cheng, C.-H.; Tsai, K.-H. Robust S1 and S2 heart sound recognition based on spectral restoration and multi-style training. Biomed. Signal Process. Control 2019, 49, 173–180. [Google Scholar] [CrossRef]
- Lee, J.; Nam, J. Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging. IEEE Signal Process. Lett. 2017, 24, 1208–1212. [Google Scholar] [CrossRef]
- Le, X.; Lei, T.; Chen, K.; Lu, J. Inference Skipping for More Efficient Real-Time Speech Enhancement With Parallel RNNs. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2411–2421. [Google Scholar] [CrossRef]
Genres | Speakers | Utterances | ECAPA-TDNN | x-Vector |
---|---|---|---|---|
Advertisement | 75 | 781 | 8.16 | 9.37 |
Drama | 377 | 4521 | 10.22 | 11.70 |
Entertainment | 1020 | 18,931 | 6.75 | 7.31 |
Interview | 1253 | 41,586 | 6.12 | 6.98 |
Live broadcast | 496 | 154,249 | 4.87 | 5.42 |
Movie | 165 | 1495 | 10.35 | 11.47 |
Play | 170 | 5476 | 10.61 | 11.56 |
Recitation | 259 | 58,839 | 15.66 | 16.55 |
Singing | 683 | 32,279 | 19.12 | 20.86 |
Speech | 331 | 39,792 | 3.10 | 3.21 |
Vlog | 524 | 120,812 | 4.63 | 5.31 |
Overall | 3000 | 485,361 | 26.78 | 27.43 |
Scenes | Noise Level | Decibel Interval |
---|---|---|
Speech | Low | 40–60 db |
Vlog | ||
Live broadcast | ||
Interview | ||
Entertainment | Medium | 60–70 db |
Advertisement | ||
Drama | ||
Movie | ||
Play | ||
Recitation | High | >70 db |
Singing |
Enroll | Speech | Speech | Speech | Advertisement | Advertisement |
---|---|---|---|---|---|
Test | Advertisement | Singing | Speech | Advertisement | Singing |
MFCC/ECAPA-TDNN | 28.36 | 23.43 | 3.10 | 8.16 | 23.66 |
PWPE/ECAPA-TDNN | 26.78 | 22.69 | 3.02 | 8.06 | 21.27 |
PWPE/ECA-Res2Net-TDNN | 25.50 | 21.52 | 2.99 | 7.98 | 20.45 |
CAM | Attention | Loss | EER (%) | |
---|---|---|---|---|
A | ECA | Multi-head | Sub-center ArcFace | 21.52 |
A1 | SE | Multi-head | Sub-center ArcFace | 21.90 |
A2 | ECA | Attentive statistics | Sub-center ArcFace | 21.63 |
A3 | ECA | Multi-head | AAM softmax | 21.72 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, S.; Zhang, H.; Zhang, X.; Su, Y.; Wang, Z. Voiceprint Recognition under Cross-Scenario Conditions Using Perceptual Wavelet Packet Entropy-Guided Efficient-Channel-Attention–Res2Net–Time-Delay-Neural-Network Model. Mathematics 2023, 11, 4205. https://doi.org/10.3390/math11194205
Wang S, Zhang H, Zhang X, Su Y, Wang Z. Voiceprint Recognition under Cross-Scenario Conditions Using Perceptual Wavelet Packet Entropy-Guided Efficient-Channel-Attention–Res2Net–Time-Delay-Neural-Network Model. Mathematics. 2023; 11(19):4205. https://doi.org/10.3390/math11194205
Chicago/Turabian StyleWang, Shuqi, Huajun Zhang, Xuetao Zhang, Yixin Su, and Zhenghua Wang. 2023. "Voiceprint Recognition under Cross-Scenario Conditions Using Perceptual Wavelet Packet Entropy-Guided Efficient-Channel-Attention–Res2Net–Time-Delay-Neural-Network Model" Mathematics 11, no. 19: 4205. https://doi.org/10.3390/math11194205
APA StyleWang, S., Zhang, H., Zhang, X., Su, Y., & Wang, Z. (2023). Voiceprint Recognition under Cross-Scenario Conditions Using Perceptual Wavelet Packet Entropy-Guided Efficient-Channel-Attention–Res2Net–Time-Delay-Neural-Network Model. Mathematics, 11(19), 4205. https://doi.org/10.3390/math11194205