A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions
Abstract
:1. Introduction
- A review of state-of-the-art AD detection methods that target imitated and synthetically generated voices;
- provision of a brief description of current AD datasets;
- a comparative analysis of existing methods and datasets to highlight the strengths and weaknesses of each AD detection family;
- a quantitative comparison of recent state-of-the-art AD detection methods; and
- a discussion of the challenges and potential future research directions in this area.
2. Types of Audio Deepfake Attacks
3. Fake Audio Detection Methods
4. Fake Audio Detection Datasets
5. Discussion
6. Challenges and Future Research Directions
6.1. Limited AD Detection Methods with Respect to Non-English Languages
6.2. Lack of Accent Assessment in Existing AD Detection Methods
6.3. Excessive Preprocessing to Build Deepfake Detection Models
6.4. Limited Assessment of Noisy Audio in Existing AD Detection Methods
6.5. Limited AD Imitation-Based Detection Methods
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Lyu, S. Deepfake detection: Current challenges and next steps. IEEE Comput. Soc. 2020, 1–6. [Google Scholar] [CrossRef]
- Diakopoulos, N.; Johnson, D. Anticipating and addressing the ethical implications of deepfakes in the context of elections. New Media Soc. 2021, 23, 2072–2098. [Google Scholar] [CrossRef]
- Rodríguez-Ortega, Y.; Ballesteros, D.M.; Renza, D. A machine learning model to detect fake voice. In Applied Informatics; Florez, H., Misra, S., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 3–13. [Google Scholar]
- Chen, T.; Kumar, A.; Nagarsheth, P.; Sivaraman, G.; Khoury, E. Generalization of audio deepfake detection. In Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 132–137. [Google Scholar]
- Ballesteros, D.M.; Rodriguez-Ortega, Y.; Renza, D.; Arce, G. Deep4SNet: Deep learning for fake speech classification. Expert Syst. Appl. 2021, 184, 115465. [Google Scholar] [CrossRef]
- Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph. ToG 2017, 36, 1–13. [Google Scholar] [CrossRef]
- Catherine Stupp Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case. Available online: https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402 (accessed on 29 January 2022).
- Chadha, A.; Kumar, V.; Kashyap, S.; Gupta, M. Deepfake: An overview. In Proceedings of Second International Conference on Computing, Communications, and Cyber-Security; Singh, P.K., Wierzchoń, S.T., Tanwar, S., Ganzha, M., Rodrigues, J.J.P.C., Eds.; Springer: Singapore, 2021; pp. 557–566. [Google Scholar]
- Tan, X.; Qin, T.; Soong, F.; Liu, T.-Y. A survey on neural speech synthesis. arXiv 2021, arXiv:2106.15561. [Google Scholar]
- Ning, Y.; He, S.; Wu, Z.; Xing, C.; Zhang, L.-J. A Review of Deep Learning Based Speech Synthesis. Appl. Sci. 2019, 9, 4050. [Google Scholar] [CrossRef] [Green Version]
- Ren, Y.; Hu, C.; Tan, X.; Qin, T.; Zhao, S.; Zhao, Z.; Liu, T.-Y. Fastspeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
- Shen, J.; Pang, R.; Weiss, R.J.; Schuster, M.; Jaitly, N.; Yang, Z.; Chen, Z.; Zhang, Y.; Wang, Y.; Skerrv-Ryan, R. Natural Tts Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions; IEEE: Piscataway, NJ, USA, 2018; pp. 4779–4783. [Google Scholar]
- Ping, W.; Peng, K.; Gibiansky, A.; Arik, S.O.; Kannan, A.; Narang, S.; Raiman, J.; Miller, J. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv 2017, arXiv:1710.07654. [Google Scholar]
- Khanjani, Z.; Watson, G.; Janeja, V.P. How deep are the fakes? Focusing on audio deepfake: A survey. arXiv 2021, arXiv:2111.14203. [Google Scholar]
- Pradhan, S.; Sun, W.; Baig, G.; Qiu, L. Combating replay attacks against voice assistants. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2019, 3, 1–26. [Google Scholar] [CrossRef]
- Ballesteros, D.M.; Rodriguez, Y.; Renza, D. A dataset of histograms of original and fake voice recordings (H-voice). Data Brief 2020, 29, 105331. [Google Scholar] [CrossRef]
- Singh, A.K.; Singh, P. Detection of ai-synthesized speech using cepstral & bispectral statistics. In Proceedings of the 2021 IEEE 4th International Conference on Multimedia Information Processing and Retrieval (MIPR), Tokyo, Japan, 8–10 September 2021; pp. 412–417. [Google Scholar]
- Borrelli, C.; Bestagini, P.; Antonacci, F.; Sarti, A.; Tubaro, S. Synthetic speech detection through short-term and long-term prediction traces. EURASIP J. Inf. Secur. 2021, 2021, 2. [Google Scholar] [CrossRef]
- Todisco, M.; Wang, X.; Vestman, V.; Sahidullah, M.; Delgado, H.; Nautsch, A.; Yamagishi, J.; Evans, N.; Kinnunen, T.; Lee, K.A. ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv 2019, arXiv:1904.05441. [Google Scholar]
- Liu, T.; Yan, D.; Wang, R.; Yan, N.; Chen, G. Identification of fake stereo audio using SVM and CNN. Information 2021, 12, 263. [Google Scholar] [CrossRef]
- Subramani, N.; Rao, D. Learning efficient representations for fake speech detection. In Proceedings of the The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; pp. 5859–5866. [Google Scholar]
- Bartusiak, E.R.; Delp, E.J. Frequency domain-based detection of generated audio. In Proceedings of the Electronic Imaging; Society for Imaging Science and Technology, New York, NY, USA, 11–15 January 2021; Volume 2021, pp. 273–281. [Google Scholar]
- Lataifeh, M.; Elnagar, A.; Shahin, I.; Nassif, A.B. Arabic audio clips: Identification and discrimination of authentic cantillations from imitations. Neurocomputing 2020, 418, 162–177. [Google Scholar] [CrossRef]
- Lataifeh, M.; Elnagar, A. Ar-DAD: Arabic diversified audio dataset. Data Brief 2020, 33, 106503. [Google Scholar] [CrossRef]
- Lei, Z.; Yang, Y.; Liu, C.; Ye, J. Siamese convolutional neural network using gaussian probability feature for spoofing speech detection. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 1116–1120. [Google Scholar]
- Hofbauer, H.; Uhl, A. Calculating a boundary for the significance from the equal-error rate. In Proceedings of the 2016 International Conference on Biometrics (ICB), Halmstad, Sweden, 13 June 2016; pp. 1–4. [Google Scholar]
- Camacho, S.; Ballesteros, D.M.; Renza, D. Fake speech recognition using deep learning. In Applied Computer Sciences in Engineering; Figueroa-García, J.C., Díaz-Gutierrez, Y., Gaona-García, E.E., Orjuela-Cañón, A.D., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 38–48. [Google Scholar]
- Reimao, R.; Tzerpos, V. For: A dataset for synthetic speech detection. In Proceedings of the 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Timisoara, Romania, 10 October 2019; pp. 1–10. [Google Scholar]
- Yu, H.; Tan, Z.-H.; Ma, Z.; Martin, R.; Guo, J. Guo spoofing detection in automatic speaker verification systems using DNN classifiers and dynamic acoustic features. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 4633–4644. [Google Scholar] [CrossRef] [Green Version]
- Wu, Z.; Kinnunen, T.; Evans, N.; Yamagishi, J.; Hanilçi, C.; Sahidullah, M.; Sizov, A. ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015; p. 5. [Google Scholar]
- Wang, R.; Juefei-Xu, F.; Huang, Y.; Guo, Q.; Xie, X.; Ma, L.; Liu, Y. Deepsonar: Towards effective and robust detection of ai-synthesized fake voices. In Proceedings of the the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1207–1216. [Google Scholar]
- Wijethunga, R.L.M.A.P.C.; Matheesha, D.M.K.; Al Noman, A.; De Silva, K.H.V.T.A.; Tissera, M.; Rupasinghe, L. Rupasinghe deepfake audio detection: A deep learning based solution for group conversations. In Proceedings of the 2020 2nd International Conference on Advancements in Computing (ICAC), Malabe, Sri Lanka, 10–11 December 2020; Volume 1, pp. 192–197. [Google Scholar]
- Chintha, A.; Thai, B.; Sohrawardi, S.J.; Bhatt, K.M.; Hickerson, A.; Wright, M.; Ptucha, R. Ptucha recurrent convolutional structures for audio spoof and video deepfake detection. IEEE J. Sel. Top. Signal. Process. 2020, 14, 1024–1037. [Google Scholar] [CrossRef]
- Kinnunen, T.; Lee, K.A.; Delgado, H.; Evans, N.; Todisco, M.; Sahidullah, M.; Yamagishi, J.; Reynolds, D.A. T-DCF: A detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification. arXiv 2018, arXiv:1804.09618. [Google Scholar]
- Shan, M.; Tsai, T. A cross-verification approach for protecting world leaders from fake and tampered audio. arXiv 2020, arXiv:2010.12173. [Google Scholar]
- Aravind, P.R.; Nechiyil, U.; Paramparambath, N. Audio spoofing verification using deep convolutional neural networks by transfer learning. arXiv 2020, arXiv:2008.03464. [Google Scholar]
- Khochare, J.; Joshi, C.; Yenarkar, B.; Suratkar, S.; Kazi, F. A deep learning framework for audio deepfake detection. Arab. J. Sci. Eng. 2021, 47, 3447–3458. [Google Scholar] [CrossRef]
- Khalid, H.; Kim, M.; Tariq, S.; Woo, S.S. Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In Proceedings of the 1st Workshop on Synthetic Multimedia, ACM Association for Computing Machinery, New York, NY, USA, 20 October 2021; pp. 7–15. [Google Scholar]
- Khalid, H.; Tariq, S.; Kim, M.; Woo, S.S. FakeAVCeleb: A novel audio-video multimodal deepfake dataset. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021) Track on Datasets and Benchmarks, Virtual, 6–14 December 2021; p. 14. [Google Scholar]
- Alzantot, M.; Wang, Z.; Srivastava, M.B. Deep residual neural networks for audio spoofing detection. arXiv CoRR 2019, arXiv:1907.00501. [Google Scholar]
- Arif, T.; Javed, A.; Alhameed, M.; Jeribi, F.; Tahir, A. Voice spoofing countermeasure for logical access attacks detection. IEEE Access 2021, 9, 162857–162868. [Google Scholar] [CrossRef]
- Lai, C.-I.; Chen, N.; Villalba, J.; Dehak, N. ASSERT: Anti-spoofing with squeeze-excitation and residual networks. arXiv 2019, arXiv:1904.01120. [Google Scholar]
- Jiang, Z.; Zhu, H.; Peng, L.; Ding, W.; Ren, Y. Self-supervised spoofing audio detection scheme. In Proceedings of the INTERSPEECH 2020, Shanghai, China, 25–29 October 2020; pp. 4223–4227. [Google Scholar]
- Imdat Solak The M-AILABS Speech Dataset. Available online: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ (accessed on 10 March 2022).
- Arik, S.O.; Chen, J.; Peng, K.; Ping, W.; Zhou, Y. Neural voice cloning with a few samples. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 2–8 December 2018; p. 11. [Google Scholar]
- Yi, J.; Fu, R.; Tao, J.; Nie, S.; Ma, H.; Wang, C.; Wang, T.; Tian, Z.; Bai, Y.; Fan, C. Add 2022: The first audio deep synthesis detection challenge. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Singapore, 23–27 May 2022; p. 5. [Google Scholar]
- Kinnunen, T.; Sahidullah, M.; Delgado, H.; Todisco, M.; Evans, N.; Yamagishi, J.; Lee, K.A. The 2nd Automatic Speaker Verification Spoofing and Countermeasures Challenge (ASVspoof 2017) Database, Version 2. Available online: https://datashare.ed.ac.uk/handle/10283/3055 (accessed on 5 November 2021).
- Nations, U. Official Languages. Available online: https://www.un.org/en/our-work/official-languages (accessed on 5 March 2022).
- Almeman, K.; Lee, M. A comparison of arabic speech recognition for multi-dialect vs. specific dialects. In Proceedings of the Seventh International Conference on Speech Technology and Human-Computer Dialogue (SpeD 2013), Cluj-Napoca, Romania, 16–19 October 2013; pp. 16–19. [Google Scholar]
- Elgibreen, H.; Faisal, M.; Al Sulaiman, M.; Abdou, S.; Mekhtiche, M.A.; Moussa, A.M.; Alohali, Y.A.; Abdul, W.; Muhammad, G.; Rashwan, M.; et al. An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus. IEEE Access 2021, 9, 88405–88428. [Google Scholar] [CrossRef]
- Asif, A.; Mukhtar, H.; Alqadheeb, F.; Ahmad, H.F.; Alhumam, A. An approach for pronunciation classification of classical arabic phonemes using deep learning. Appl. Sci. 2022, 12, 238. [Google Scholar] [CrossRef]
- Ibrahim, A.B.; Seddiq, Y.M.; Meftah, A.H.; Alghamdi, M.; Selouani, S.-A.; Qamhan, M.A.; Alotaibi, Y.A.; Alshebeili, S.A. Optimizing Arabic Speech Distinctive Phonetic Features and Phoneme Recognition Using Genetic Algorithm. IEEE Access 2020, 8, 200395–200411. [Google Scholar] [CrossRef]
- Maw, M.; Balakrishnan, V.; Rana, O.; Ravana, S.D. Trends and patterns of text classification techniques: A systematic mapping study. Malays. J. Comput. Sci. 2020, 33, 102–117. [Google Scholar]
- Rizwan, M.; Odelowo, B.O.; Anderson, D.V. Word based dialect classification using extreme learning machines. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24 July 2016; pp. 2625–2629. [Google Scholar]
- Najafian, M. Modeling accents for automatic speech recognition. In Proceedings of the 23rd European Signal Proceedings (EUSIPCO), Nice, France, 31 August–4 September 2015; University of Birmingham: Birmingham, UK, 2013; Volume 1568, p. 1. [Google Scholar]
- Liu, X.; Zhang, F.; Hou, Z.; Mian, L.; Wang, Z.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. 2021. [Google Scholar] [CrossRef]
- Jain, D.; Beniwal, D.P. Review paper on noise cancellation using adaptive filters. Int. J. Eng. Res. Technol. 2022, 11, 241–244. [Google Scholar]
Year | Ref. | Speech Language | Fakeness Type | Technique | Audio Feature Used | Dataset | Drawbacks |
---|---|---|---|---|---|---|---|
2018 | Yu et al. [29] | English | Synthetic | DNN-HLL | MFCC, LFCC, CQCC | ASV spoof 2015 [30] | The error rate is zero, indicating that the proposed DNN is overfitting. |
GMM-LLR | IMFCC, GFCC, IGFCC | Does not carry much artifact information in the feature representations perspective. | |||||
2019 | Alzantot et al. [40] | English | Synthetic | Residual CNN | MFCC, CQCC, STFT | ASV spoof 2019 [19] | The model is highly overfitting with synthetic data and cannot be generalized over unknown attacks. |
2019 | C. Lai et al. [42] | English | Synthetic | ASSERT (SENet + ResNet) | Logspec, CQCC | ASV spoof 2019 [19] | The model is highly overfitting with synthetic data. |
2020 | P. RahulT et al. [36] | English | Synthetic | ResNet-34 | Spectrogram | ASV spoof 2019 [19] | Requires transforming the input into a 2-D feature map before the detection process, which increases the training time and effects its speed. |
2020 | Lataifeh et al. [23] | Classical Arabic | Imitation | Classical Classifiers (SVM-Linear, SVMRBF, LR, DT, RF, XGBoost) | - | Arabic Diversified Audio (AR-DAD) [24] | Failed to capture spurious correlations, and features are extracted manually so they are not scalable and needs extensive manual labor to prepare the data. |
DL Classifiers (CNN, BiLSTM) | MFCC spectrogram | DL accuracy was not as good as the classical methods, and they are an image-based approach that requires special transformation of the data. | |||||
2020 | Rodríguez-Ortega et al. [3] | Spanish, English, Portuguese, French, and Tagalog | Imitation | LR | Time domain waveform | H-Voice [16] | Failed to capture spurious correlations, and features are extracted manually so it is not scalable and needs extensive manual labor to prepare the data. |
2020 | Wang et al. [31] | English, Chinese | Synthetic | Deep-Sonar | High-dimensional data visualization of MFCC, raw neuron, activated neuron | FoR dataset [28] | Highly affected by real-world noises. |
2020 | Subramani and Rao [21] | English | Synthetic | EfficientCNN and RES-EfficientCNN | Spectrogram | ASV spoof 2019 [19] | They use an image-based approach that requires special transformation of the data to transfer audio files into images. |
2020 | Shan and Tsai [35] | English | Synthetic | Bidirectional LSTM | MFCC | -- | The method did not perform well over long 5 s edits. |
2020 | Wijethunga et al. [32] | English | Synthetic | DNN | MFCC, Mel-spectrogram, STFT | Urban-Sound8K, Conversational, AMI-Corpus, and FoR | The proposed model does not carry much artifact information from the feature representations perspective. |
2020 | Jiang et al. [43] | English | Synthetic | SSAD | LPS, LFCC, CQCC | ASV spoof 2019 [19] | It needs extensive computing processing since it uses a temporal convolutional network (TCN) to capture the context features and another three regression workers and one binary worker to predict the target features. |
2020 | Chintha et al. [33] | English | Synthetic | CRNN-Spoof | CQCC | ASV spoof 2019 [19] | The model proposed is complex and contains many layers and convolutional networks, so it needs an extensive computing process. Did not perform well compared to WIRE-Net-Spoof. |
WIRE- Net-Spoof | MFCC | Did not perform well compared to CRNN-Spoof. | |||||
2020 | Kumar-Singh and Singh [17] | English | Synthetic | Q-SVM | MFCC, Mel-spectrogram | -- | Features are extracted manually so it is not scalable and needs extensive manual labor to prepare the data. |
2020 | Zhenchun Lei et al. [25] | English | Synthetic | CNN and Siamese CNN | CQCC, LFCC | ASV spoof 2019 [19] | The models are not robust to different features and work best with LFCC only. |
2021 | M. Ballesteros et al. [5] | Spanish, English, Portuguese, French, and Tagalog | Synthetic Imitation | Deep4SNet | Histogram, Spectrogram, Time domain waveform | H-Voice [16] | The model was not scalable and was affected by the data transformation process. |
2021 | E.R. Bartusiak and E.J. Delp [22] | English | Synthetic | CNN | Spectrogram | ASV spoof 2019 [19] | They used an image-based approach, which required a special transformation of the data, and the authors found that the model proposed failed to correctly classify new audio signals indicating that the model is not general enough. |
2021 | Borrelli et al. [18] | English | Synthetic | RF, SVM | STLT | ASV spoof 2019 [19] | Features extracted manually so they are not scalable and needs extensive manual labor to prepare the data. |
2021 | Khalid et al. [38] | English | Synthetic | MesoInception-4, Meso-4, Xception, EfficientNet-B0, VGG16 | Three-channel image of MFCC | FakeAVCeleb [39] | It was observed from the experiment that Meso-4 overfits the real class and MesoInception-4 overfits the fake class, and none of the methods provided a satisfactory performance indicating that they are not suitable for fake audio detection. |
2021 | Khochare et al. [37] | English | Synthetic | Feature-based (SVM, RF, KNN, XGBoost, and LGBM) | Vector of 37 features of audio | FoR dataset [28] | Features extracted manually so they are not scalable and needs extensive manual labor to prepare the data. |
Image-based (CNN, TCN, STN) | Melspectrogram | It uses an image-based approach and could not work with inputs converted to STFT and MFCC features. | |||||
2021 | Liu et al. [20] | Chinese | Synthetic | SVM | MFCC | -- | Features extracted manually so it is not scalable and needs extensive manual labor to prepare the data. |
CNN | -- | The error rate is zero indicating that the proposed CNN is overfitting. | |||||
2021 | S. Camacho et al. [27] | English | Synthetic | CNN | Scatter plots | FoR dataset [28] | It did not perform as well as the traditional DL methods, and the model needed more training. |
2021 | T. Arif et al. [41] | English | Synthetic imitated | DBiLSTM | ELTP-LFCC | ASV spoof 2019 [19] | Does not perform well over an imitated-based dataset. |
Year | Dataset | Total Size | Real Sample Size | Fake Sample Size | Sample Length (s | Fakeness Type | Format | Speech Language | Accessibility | Dataset URL |
---|---|---|---|---|---|---|---|---|---|---|
2018 | The M-AILABS Speech [44] | 18,7 h | 9265 | 806 | 1–20 | Synthetic | WAV | German | Public | https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ (accessed 3 March 2022) |
2018 | Baidu Silicon Valley AI Lab cloned audio [45] | 6 h | 10 | 120 | 2 | Synthetic | Mp3 | English | Public | https://audiodemos.github.io/ (accessed 3 March 2022) |
2019 | Fake oR Real (FoR) [28] | 198,000 Files | 111,000 | 87,000 | 2 | Synthetic | Mp3, WAV | English | Public | https://bil.eecs.yorku.ca/datasets/(accessed 20 November 2021) |
2020 | AR-DAD: Arabic Diversified Audio [24] | 16,209 Files | 15,810 | 397 | 10 | Imitation | WAV | Classical Arabic | Public | https://data.mendeley.com/datasets/3kndp5vs6b/3(accessed 20 November 2021) |
2020 | H-Voice [16] | 6672 Files | Imitation 3332 Synthetic 4 | Imitation 3264 Synthetic 72 | 2–10 | Imitation Synthetic | PNG | Spanish, English, Portuguese, French, and Tagalog | Public | https://data.mendeley.com/datasets/k47yd3m28w/4 (accessed 20 November 2021) |
2021 | ASV spoof 2021 Challenge | - | - | - | 2 | Synthetic | Mp3 | English | Only older versions available thus far | https://datashare.ed.ac.uk/handle/10283/3336(accessed 20 November 2021) |
2021 | FakeAVCeleb [39] | 20,490 Files | 490 | 20,000 | 7 | Synthetic | Mp3 | English | Restricted | https://sites.google.com/view/fakeavcelebdash-lab/(accessed 20 November 2021) |
2022 | ADD [46] | 85 h | LF:300 PF:0 | LF:700 PF:1052 | 2–10 | Synthetic | WAV | Chinese | Public | https://sites.google.com/view/fakeavcelebdash-lab/(accessed 3 May 2022) |
Measures | Dataset | Detection Method | Results (The Result Is Approximate from the Evaluation Test Published in the Study) |
---|---|---|---|
EER | ASV spoof 2015 challenge | DNN-HLLs [29] | 12.24% |
GMM-LLR [29] | 42.5% | ||
ASV spoof 2019 challenge | Residual CNN [40] | 6.02% | |
SENet-34 [42] | 6.70% | ||
CRNN-Spoof [33] | 4.27% | ||
ResNet-34 [36] | 5.32% | ||
Siamese CNN [25] | 8.75% | ||
CNN [25] | 9.61% | ||
DBiLSTM [41] (Synthetic Audio) | 0.74% | ||
DBiLSTM [41] (Imitation-based) | 33.30% | ||
SSAD [43] | 5.31% | ||
- | Bidirectional LSTM [35] | 0.43% | |
FoR | CNN [27] | 11.00% | |
Deep-Sonar [31] | 2.10% | ||
t-DCF | ASV spoof 2019 challenge | Residual CNN [40] | 0.1569 |
SENet-34 [42] | 0.155 | ||
CRNN-Spoof [33] | 0.132 | ||
ResNet-34 [36] | 0.1514 | ||
Siamese CNN [25] | 0.211 | ||
CNN [25] | 0.217 | ||
DBiLSTM [41] (Synthetic Audio) | 0.008 | ||
DBiLSTM [41] (Imitation-based) | 0.39 | ||
Accuracy | ASV spoof 2019 challenge | CNN [22] | 85.99% |
SVM [18] | 71.00% | ||
AR-DAD | CNN [23] | 94.33% | |
BiLSTM [23] | 91.00% | ||
SVM [23] | 99.00% | ||
DT [23] | 73.33% | ||
RF [23] | 93.67% | ||
LR [23] | 98.00% | ||
XGBoost [23] | 97.67% | ||
SVMRBF [23] | 99.00% | ||
SVM-LINEAR [23] | 99.00% | ||
FoR | DNN [32] | 94.00% | |
Deep-Sonar [31] | 98.10% | ||
STN [37] | 80.00% | ||
TCN [37] | 92.00% | ||
SVM [37] | 67% | ||
RF [37] | 62% | ||
KNN [37] | 62% | ||
XGBoost [37] | 59% | ||
LGBM [37] | 60% | ||
CNN [27] | 88.00% | ||
FakeAVCeleb | EfficientNet-B0 [38] | 50.00% | |
Xception [38] | 76.00% | ||
MesoInception-4 [38] | 53.96% | ||
Meso-4 [38] | 50.36% | ||
VGG16 [38] | 67.14% | ||
H-Voice | LR [3] | 98% | |
Deep4SNet [5] | 98.5% | ||
- | Q-SVM [17] | 97.56% | |
- | CNN [20] | 99% | |
- | SVM [20] | 99% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Almutairi, Z.; Elgibreen, H. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms 2022, 15, 155. https://doi.org/10.3390/a15050155
Almutairi Z, Elgibreen H. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms. 2022; 15(5):155. https://doi.org/10.3390/a15050155
Chicago/Turabian StyleAlmutairi, Zaynab, and Hebah Elgibreen. 2022. "A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions" Algorithms 15, no. 5: 155. https://doi.org/10.3390/a15050155
APA StyleAlmutairi, Z., & Elgibreen, H. (2022). A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms, 15(5), 155. https://doi.org/10.3390/a15050155