Audio–Visual Fusion Based on Interactive Attention for Person Verification
Abstract
:1. Introduction
- We propose a gated-based multimodal feature fusion model that provides a flexible and effective way to control the flow of information between different modalities in audio–visual fusion. By using this method, we can better fuse speech and facial features, thus improving the recognition system’s performance.
- We propose a multimodal feature fusion model based on interactive attention. This model fosters a greater interaction between the two modalities than traditional attention mechanisms when calculating attention scores. After processing the unimodal and concatenated features using interactive attention, we add the unimodal feature vectors transformed by a fully connected layer to the feature vectors computed through interactive attention. The advantage of this approach lies in its ability to better preserve the original information, avoiding the loss of important features during the feature fusion process.
- The proposed models have the advantage of being easy to implement and optimize quickly, as all operations are performed in the feature vector space. For example, in this study, we utilize preprepared feature vectors to train the fusion network, which greatly reduces the experimental time.
- The fusion models we propose are decoupled from the front-end extractors, allowing them to be generalized to various feature vectors extracted by pretrained models. Because they are decoupled from the front end, our method can be applied to different modalities and pretrained models. For instance, when extracting feature vectors from other modalities, we only need to ensure consistency in the data format with the subsequent stage, allowing us to train other fusion modality experiments using this fusion model or to extract feature vectors we require using better pretrained models, further improving our experimental performance.
2. Related Work
2.1. Face Verification
2.2. Speaker Verification
2.3. Audio–Visual Person Verification
3. System
3.1. Feature Extractor
3.1.1. Face Feature Extractor
3.1.2. Speaker Feature Extractor
- SE-Res2Block: Res2Net Block + SE block;
- Multi-layer feature aggregation and summation;
- Attentive statistic pooling.
3.1.3. Extractor Parameters
3.2. Fusion Method
3.2.1. Concat Feature Fusion
3.2.2. Attention Feature Fusion
3.2.3. Gated Feature Fusion
3.2.4. Inter–Attention Feature Fusion
- 1.
- The attention scores of interactive attention, which are calculated interactively in both modalities, are more interactive than the simple attention mechanism;
- 2.
- After processing the unimodal and multimodal information, we again add the unimodal information to the multimodal information through the FC. This operation has the advantage of preventing the loss of critical information.
4. Experimental Setup
4.1. Dataset
Experimental Parameters
4.2. Experimental Results and Analysis
4.2.1. Analysis of Unimodal Experimental Results
- Language Discrepancy in Datasets: Our model was trained using the development set from the VoxCeleb1 dataset, primarily comprising English audio data. In contrast, the CNC-AV dataset mainly consists of Chinese audio data. This linguistic distinction could lead to performance variations, as language features play a significant role in speech recognition.
- Limitations in Model Generalization: Another potential factor is that the chosen model may exhibit limited generalization across different datasets. If the model’s generalization performance across various languages and data types is subpar, it may underperform on specific datasets, especially when pronounced differences exist between them.
4.2.2. Analysis of Experimental Results of Audio–Visual Fusion
Performance Analysis under Different Models
Performance Comparisons with Other Algorithms
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the ICASSP 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada, 15–20 April 2018. [Google Scholar]
- Waibel, A.; Hanazawa, T.; Hinton, G.E.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. Readings Speech Recognit. 1990, 1, 393–404. [Google Scholar]
- Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar] [CrossRef]
- Zhang, C.; Koishida, K. End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Wang, F.; Cheng, J.; Liu, W.; Liu, H. Additive Margin Softmax for Face Verification. IEEE Signal Process. Lett. 2018, 25, 926–930. [Google Scholar] [CrossRef]
- Shon, S.; Oh, T.H.; Glass, J. Noise-tolerant audio-visual online person verification using an attention-based neural network fusion. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 3995–3999. [Google Scholar]
- Rao, Y.; Lin, J.; Lu, J.; Zhou, J. Learning Discriminative Aggregation Network for Video-Based Face Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 26th Annual Conference on Neural Information Processing, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
- Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar] [CrossRef]
- Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments. In Proceedings of the Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, Marseille, France, 17–20 October 2008. [Google Scholar]
- Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [Google Scholar] [CrossRef]
- Sun, Y.; Wang, X.; Tang, X. Deep Learning Face Representation from Predicting 10,000 Classes. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1891–1898. [Google Scholar] [CrossRef]
- Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the BMVC, Swansea, UK, 7–10 September 2015. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar] [CrossRef]
- Gutman, Y. Speaker Verification using Phoneme-Adapted Gaussian Mixture Models. In Proceedings of the European Signal Processing Conference, Nice, France, 31 August–4 September 2015. [Google Scholar]
- Dehak, N.; Kenny, P.J.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Variani, E.; Xin, L.; Mcdermott, E.; Moreno, I.L.; Gonzalez-Dominguez, J. Deep neural networks for small footprint text-dependent speaker verification. In Proceedings of the IEEE International Conference on Acoustics, Florence, Italy, 7–13 May 2014. [Google Scholar]
- Snyder, D.; Garcia-Romero, D.; Povey, D.; Khudanpur, S. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017. [Google Scholar]
- Sell, G.; Duh, K.; Snyder, D.; Etter, D.; Garcia-Romero, D. Audio-Visual Person Recognition in Multimedia Data From the Iarpa Janus Program. In Proceedings of the ICASSP 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AL, Canada, 15–20 April 2018. [Google Scholar]
- Alam, J.; Boulianne, G.; Burget, L.; Dahmane, M.; Zeinali, H. Analysis of ABC Submission to NIST SRE 2019 CMN and VAST Challenge. In Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 2–5 November 2020. [Google Scholar]
- Luque, J.; Morros, R.; Garde, A.; Anguita, J.; Hernando, J. Audio, Video and Multimodal Person Identification in a Smart Room. In Proceedings of the First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006, Southampton, UK, 6–7 April 2006; pp. 258–269. [Google Scholar]
- Hormann, S.; Moiz, A.; Knoche, M.; Rigoll, G. Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020. [Google Scholar]
- Qian, Y.; Chen, Z.; Wang, S. Audio-Visual Deep Neural Network for Robust Person Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1079–1092. [Google Scholar] [CrossRef]
- Abdrakhmanova, M.; Abushakimova, S.; Khassanov, Y.; Varol, H.A. A Study of Multimodal Person Verification Using Audio-Visual-Thermal Data. arXiv 2021, arXiv:2110.12136. [Google Scholar]
- Saeed, M.S.; Nawaz, S.; Khan, M.H.; Javed, S.; Yousaf, M.H.; Bue, A.D. Learning Branched Fusion and Orthogonal Projection for Face-Voice Association. arXiv 2022, arXiv:2208.10238. [Google Scholar]
- Sun, P.; Zhang, S.; Liu, Z.; Yuan, Y.; Zhang, T.; Zhang, H.; Hu, P. Learning Audio-Visual embedding for Person Verification in the Wild. arXiv 2022, arXiv:cs.CV/2209.04093. [Google Scholar]
- Mamieva, D.; Abdusalomov, A.B.; Kutlimuratov, A.; Muminov, B.; Whangbo, T.K. Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features. Sensors 2023, 23, 5475. [Google Scholar] [CrossRef]
- Atmaja, B.T.; Sasou, A. Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations. Sensors 2022, 22, 6369. [Google Scholar] [CrossRef]
- Rajasekar, G.P.; de Melo, W.C.; Ullah, N.; Aslam, H.; Zeeshan, O.; Denorme, T.; Pedersoli, M.; Koerich, A.; Bacon, S.; Cardinal, P.; et al. A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition. arXiv 2022, arXiv:2203.14779. [Google Scholar]
- Jeon, S.; Kim, M.S. Noise-Robust Multimodal Audio-Visual Speech Recognition System for Speech-Based Interaction Applications. Sensors 2022, 22, 7738. [Google Scholar] [CrossRef] [PubMed]
- Ma, P.; Haliassos, A.; Fernandez-Lopez, A.; Chen, H.; Petridis, S.; Pantic, M. Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023. [Google Scholar] [CrossRef]
- Lin, J.; Cai, X.; Dinkel, H.; Chen, J.; Yan, Z.; Wang, Y.; Zhang, J.; Wu, Z.; Wang, Y.; Meng, H. AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction. arXiv 2023, arXiv:2306.14170. [Google Scholar]
- Liu, M.; Lee, K.A.; Wang, L.; Zhang, H.; Zeng, C.; Dang, J. Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification. arXiv 2023, arXiv:2302.11254. [Google Scholar]
- Moufidi, A.; Rousseau, D.; Rasti, P. Attention-Based Fusion of Ultrashort Voice Utterances and Depth Videos for Multimodal Person Identification. Sensors 2023, 23, 5890. [Google Scholar] [CrossRef]
- Qin, Z.; Zhao, P.; Zhuang, T.; Deng, F.; Ding, Y.; Chen, D. A survey of identity recognition via data fusion and feature learning. Inf. Fusion 2023, 91, 694–712. [Google Scholar] [CrossRef]
- John, V.; Kawanishi, Y. Audio-Visual Sensor Fusion Framework using Person Attributes Robust to Missing Visual Modality for Person Recognition. In Proceedings of the International Conference on Multimedia Modeling, Bergen, Norway, 9–12 January 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 523–535. [Google Scholar]
- Tracey, J.; Strassel, S.M. VAST: A Corpus of Video Annotation for Speech Technologies. In Proceedings of the Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A large-scale speaker identification dataset. arXiv 2017, arXiv:1706.08612. [Google Scholar]
- Chung, J.S.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep Speaker Recognition. arXiv 2018, arXiv:1806.05622. [Google Scholar]
- Abdrakhmanova, M.; Kuzdeuov, A.; Jarju, S.; Khassanov, Y.; Lewis, M.; Varol, H.A. SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams. Sensors 2021, 21, 3465. [Google Scholar] [CrossRef]
- Li, L.; Li, X.; Jiang, H.; Chen, C.; Hou, R.; Wang, D. CN-Celeb-AV: A Multi-Genre Audio-Visual Dataset for Person Recognition. In Proceedings of the INTERSPEECH 2023, Dublin, Ireland, 20–24 August 2023; pp. 2118–2122. [Google Scholar] [CrossRef]
- Sadjadi, O.; Greenberg, C.; Singer, E.; Reynolds, D.; Hernandez-Cordero, J. The 2019 NIST Audio-Visual Speaker Recognition Evaluation. In Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 2–5 November 2020. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Yi, D.; Lei, Z.; Liao, S.; Li, S.Z. Learning Face Representation from Scratch. arXiv 2014, arXiv:1411.7923. [Google Scholar]
- Corbetta, M.; Shulman, G.L. Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci. 2002, 3, 201. [Google Scholar] [CrossRef] [PubMed]
- Arevalo, J.; Solorio, T.; Montes-y Gómez, M.; González, F.A. Gated Multimodal Units for Information Fusion. arXiv 2017, arXiv:1702.01992. [Google Scholar]
- Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Grother, P. IARPA Janus Benchmark-B Face Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Bishop, C. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A Discriminative Feature Learning Approach for Deep Face Recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Methods | Advantages | Disadvantages |
---|---|---|
Early Fusion | (1) Simple, intuitive, and easy to implement. (2) Considers global information for audio and visual. | (1) Localized information for each modality is ignored. (2) Needs to deal with size and representation inconsistencies between different modal data. |
Late Fusion | (1) Capable of capturing localized information for each modality. (2) Different network structures can be used to handle different modes. | (1) Cannot fully utilize the correlation between modes. (2) May require more parameters and computational resources. |
Mid-level Fusion | (1) Combines the advantages of early fusion and late fusion, taking into account both global and local information. (2) Different levels of fusion can be flexibly selected. | Higher requirements for the design and adaptation of network structures. |
Attention-based Fusion | Capable of adaptively capturing key information for each modality. | (1) The training and inference process may be more complex. (2) Additional computational resources are required. |
Dev | Test | |
---|---|---|
# of speakers | 1211 | 40 |
# of videos | 21,819 | 677 |
# of utterances | 128,642 | 4708 |
# of images | 1,167,721 | 39,085 |
Condition | Split | #Enroll Videos | #Test Videos | #Target | #Nontarget |
---|---|---|---|---|---|
SRE19 | DEV | 102 | 319 | 244 | 32,294 |
EVAL | 258 | 914 | 681 | 235,131 |
CNC-AV-Dev-F | CNC-AV-Eval-F | CNC-AV-Eval-P | |
---|---|---|---|
# of Genres | 11 | 11 | 11 |
# of Persons | 689 | 197 | 250 |
# of Segments | 93,973 | 17,717 | 307,973 |
# of Hours | 199,70 | 41,96 | 427,74 |
Datasets | Modality | System | EER (%) | minDCF |
---|---|---|---|---|
VoxCeleb1 | Audio | ECAPA-TDNN | 0.98 | 0.068 |
Visual | FaceNet | 3.96 | 0.263 | |
Resnet50 | 5.26 | 0.276 | ||
NIST SRE19 | Audio | ECAPA-TDNN | 7.93 | 0.484 |
Visual | FaceNet | 9.28 | 0.25 | |
Resnet50 | 13.85 | 0.358 | ||
CNC-AV | Audio | ECAPA-TDNN | 17.04 | 0.764 |
Visual | FaceNet | 27.49 | 0.743 | |
Resnet50 | 29.89 | 0.776 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jing, X.; He, L.; Song, Z.; Wang, S. Audio–Visual Fusion Based on Interactive Attention for Person Verification. Sensors 2023, 23, 9845. https://doi.org/10.3390/s23249845
Jing X, He L, Song Z, Wang S. Audio–Visual Fusion Based on Interactive Attention for Person Verification. Sensors. 2023; 23(24):9845. https://doi.org/10.3390/s23249845
Chicago/Turabian StyleJing, Xuebin, Liang He, Zhida Song, and Shaolei Wang. 2023. "Audio–Visual Fusion Based on Interactive Attention for Person Verification" Sensors 23, no. 24: 9845. https://doi.org/10.3390/s23249845
APA StyleJing, X., He, L., Song, Z., & Wang, S. (2023). Audio–Visual Fusion Based on Interactive Attention for Person Verification. Sensors, 23(24), 9845. https://doi.org/10.3390/s23249845