Audio Deep Fake Detection with Sonic Sleuth Model
Abstract
:1. Introduction
2. Background
2.1. Deepfake Generation
2.2. Time Domain and Frequency Domain
2.3. Literature Review
Research Gap and Main Contributions
- Feature learning: Unlike traditional techniques, DL models can autonomously learn features from raw audio data, improving their adaptability to new and unseen variations in audio signals.
- Scalability: Deep learning models demonstrate strong scalability, maintaining or even enhancing performance as they are trained on increasingly large datasets.
3. Approach
3.1. Datasets
- ASVspoof2019 ASVspoof (Automatic Speaker Verification Spoofing and Countermeasures) is an international challenge focusing on spoofing detection in automatic speaker verification systems [14]. The ASVspoof2019 dataset contains three sub-datasets: Logical Access, Physical Access, and Speech Deepfake. Each of these datasets was created using different techniques depending on the task, like text-to-speech (TTS) and voice conversion (VC) algorithms [14]. We utilize the train subset of the Logical Access dataset in our experiment.
- −
- The Logical Access (LA) subset, which we used, includes synthetic speech created using state-of-the-art TTS technologies. This dataset ensures that our model is exposed to sophisticated spoofing attacks.
- −
- ASVspoof2019 provides 2580 real audio files and 22,800 fake audio files, with an average length of 3 s per file.
- In-the-Wild is a dataset consisting of real and deepfake audio sourced from publicly available recordings of 58 politicians and celebrities. This dataset spans a total of 20.8 h of real audio and 17.5 h of fake audio, ensuring a broad representation of different speaking styles, tones, and environments [10].
- −
- Each speaker has an average of 23 min of real audio and 18 min of fake audio, with an average length of 4.3 s per clip. The audio was sourced from diverse environments, including media interviews, public speeches, and social media clips, which introduce realistic variances such as background noise, variable quality, and different accents.
- −
- The fake audio in this dataset was generated using various techniques collected from social media, mimicking public figures in both scripted and spontaneous settings. This dataset simulates real-world challenges by providing audio in uncontrolled conditions, thus testing the model’s ability to generalize across diverse and noisy environments.
- FakeAVCeleb contains both deepfake and real audio and videos of celebrities. The fake audio was created using a text-to-speech service followed by manipulation with a voice cloning tool to mimic the celebrity’s voice [15]. We extracted the audio as WAV files from each video.
- −
- The dataset includes 10,209 real audio files and 11,357 fake audio files, extracted from videos with an average length of 5 s per file.
- −
- FakeAVCeleb mimics the challenges of detecting AI-generated content from popular sources, such as social media, where fake media can spread rapidly. This dataset contributes to testing the model’s robustness in detecting manipulated audio in entertainment and digital media contexts.
- The Fake-or-Real (FoR) dataset comprises 111,000 files of real speech and 87,000 files of fake speech. It encompasses both MP3 and WAV file formats, offering four distinct versions to suit various needs. The ‘for-original’ files are in their original state, while the ‘for-norm’ version has been subjected to normalization. ‘For-2sec’ is shortened to 2 s, while ‘for-rerec’ simulates re-recorded data, depicting deepfake from a phone call scenario [16].
- −
- The dataset offers four distinct versions, including a normalized version and a re-recorded version, which simulates a phone call scenario. The diversity in data formats and audio lengths makes this dataset ideal for testing model performance in a variety of real-world applications.
- −
- The FoR data closely simulate common deepfake usage, such as telephone fraud or manipulated voice recordings in conversational settings, adding another layer of complexity to the evaluation process.
- Diversity in generation techniques: The datasets include audio generated using various deepfake technologies such as TTS, VC, and cloning techniques. This diversity ensures the model is exposed to the range of methods used to generate synthetic audio, making it more effective in real-world applications.
- Generalization across environments: Datasets like In-the-Wild simulate real-world conditions, including background noise, different recording devices, and varied speaking environments, which are common in public audio recordings. This enhances the model’s ability to detect deepfakes in uncontrolled settings, where real-world challenges such as poor audio quality or overlapping voices exist.
- Realistic data representation: The inclusion of datasets such as ASVspoof2019 and Fake-or-Real ensures that our model is exposed to both high-quality and noisy audio, from both controlled experiments and re-recorded settings, thus improving its robustness. This combination allows the model to detect deepfakes in situations like social media voice messages, fraudulent phone calls, and doctored recordings.
- Variety of speakers and content: With a wide range of speakers, accents, languages, and contexts (e.g., political speeches, interviews), the datasets ensure the model is not biased towards a specific type of speaker or content, but can generalize across different contexts, which is essential for real-world detection.
3.2. Data Preprocessing
- First, any silence in the audio was trimmed to enable the model to focus on speech parts of the audio.
- Second, all audio clips were trimmed to a standardized length of 4 s or padded by repeating the audio to avoid silence that could potentially bias the model’s learning.
- Third, the audio data were downsampled to a rate of 16 kHz. Downsampling reduces the data size and computational load without significantly compromising the audio signal’s quality or essential characteristics.
- Lastly, the audio was converted to a mono channel using channel averaging. Stereo audio, which uses two channels (left and right), is averaged into a single channel, resulting in a mono audio signal.
3.3. Feature Extraction
3.3.1. Short-Time Power Spectrum
3.3.2. Constant-Q Transform
3.4. Sonic Sleuth Structure and Training Details
- Early stopping: An early stopping technique was implemented with a patience of 10 epochs to prevent overfitting. This means the training process would be stopped if the validation loss did not improve for 10 consecutive epochs.
- Optimizer: The Adam optimizer was used for updating the model’s weights during training.
- Loss function: Binary cross-entropy loss was used as the loss function, which is suitable for binary classification problems.
- Class weights: Class weights were calculated and applied during the training process to handle the class imbalance in the dataset. Class weights adjust the importance of each class during the loss calculation, helping the model learn from imbalanced datasets more effectively while avoiding bias.
3.5. Evaluation Metrics
4. Results and Discussion
4.1. Test on External Dataset
4.2. Ensemble Approach
4.3. Model Performance Comparing to Recent Deepfake Detection Methods
- Xception model performance: The Xception model demonstrated the best performance on the FakeAVCeleb dataset, achieving an EER of 0.2472 and an accuracy of 73.06%. This model’s strong performance indicates its robustness in detecting audio deepfakes. However, further refinement is needed for broader applicability across datasets.
- MesoInception insights: For the In-the-Wild dataset, the MesoInception model showed an EER of 0.37414 when evaluated with log-spectrogram features. The results indicate that longer audio inputs enhance detection capabilities. However, the model faced generalization challenges, suggesting that performance may vary in real-world scenarios.
- CQCC-ResNet analysis: The CQCC-ResNet model, tested on the ASVspoof2019 dataset, achieved an EER of 0.0769 on unknown attacks. This highlights the model’s effectiveness in recognizing familiar patterns but also its struggle with novel spoofing techniques, suggesting the need for better generalization capabilities.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AI | Artificial intelligence |
ASV | Automatic speaker verification |
CQT | Constant-Q transform |
CNN | Convolutional neural network |
DL | Deep learning |
DNN | Deconvolutional neural network |
EER | Equal error rate |
FNR | False negative rate |
FPR | False positive rate |
GAN | Generative adversarial network |
LFCCs | Linear frequency cepstral coefficients |
MFCCs | Mel-frequency cepstral coefficients |
ROC | Receiver operating characteristic |
TTS | Text-to-speech |
VC | Voice conversion |
References
- Oh, S.; Kang, M.; Moon, H.; Choi, K.; Chon, B.S. A demand-driven perspective on generative audio AI. arXiv 2023, arXiv:2307.04292. [Google Scholar]
- Deepfakes (a Portmanteau of “Deep Learning” and “Fake”). Images, Videos, or Audio Edited or Generated Using Artificial Intelligence Tools. Synthetic Media. 2023. Available online: https://en.wikipedia.org/wiki/Deepfake (accessed on 4 May 2020).
- Gu, Y.; Chen, Q.; Liu, K.; Xie, L.; Kang, C. GAN-based Model for Residential Load Generation Considering Typical Consumption Patterns. In Proceedings of the ISGT 2019, Washington, DC, USA, 18–21 February 2019; IEEE: Piscataway, NJ, USA, November, 2018. [Google Scholar] [CrossRef]
- Camastra, F.; Vinciarelli, A. Machine Learning for Audio, Image and Video Analysis: Theory and Applications; Springer: London, UK, 2015. [Google Scholar]
- Tenoudji, F.C. Analog and Digital Signal Analysis: From Basics to Applications; Springer International Publishing: Cham, Switzerland, 2018. [Google Scholar]
- Natsiou, A.; O’Leary, S. Audio Representations for Deep Learning in Sound Synthesis: A Review. arXiv 2022, arXiv:2201.02490. [Google Scholar]
- Marcus, G. The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence. arXiv 2020, arXiv:2002.06177. [Google Scholar]
- Frank, J.; Schönherr, L. WaveFake: A Data Set to Facilitate Audio Deepfake Detection. arXiv 2021, arXiv:2111.02813. [Google Scholar]
- Kawa, P.; Plata, M.; Syga, P. Attack Agnostic Dataset: Towards Generalization and Stabilization of Audio DeepFake Detection. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022. [Google Scholar] [CrossRef]
- Müller, N.M.; Czempin, P.; Dieckmann, F.; Froghyar, A.; Böttinger, K. Does audio deepfake detection generalize? arXiv 2024, arXiv:2203.16263. [Google Scholar]
- Almutairi, Z.; Elgibreen, H. A Review of Modern Audio Deepfake Detection Methods: Challenges and Future Directions. Algorithms 2022, 15, 155. [Google Scholar] [CrossRef]
- Sun, C.; Jia, S.; Hou, S.; AlBadawy, E.; Lyu, S. Exposing AI-Synthesized Human Voices Using Neural Vocoder Artifacts. arXiv 2023, arXiv:2302.09198. [Google Scholar]
- Zhang, C.; Zhang, C.; Zheng, S.; Zhang, M.; Qamar, M.; Bae, S.-H.; Kweon, I.S. A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI. arXiv 2023, arXiv:2303.13336. [Google Scholar]
- Wang, X.; Yamagishi, J.; Todisco, M.; Delgado, H.; Nautsch, A.; Evans, N.; Sahidullah, M.; Vestman, V.; Kinnunen, T.; Lee, K.A.; et al. ASVspoof 2019: A Large-Scale Public Database of Synthesized, Converted and Replayed Speech. arXiv 2020, arXiv:1911.01601. [Google Scholar] [CrossRef]
- Khalid, H.; Tariq, S.; Kim, M.; Woo, S.S. FakeAVCeleb: A Novel Audio-Video Multimodal Deepfake Dataset. In Proceedings of the Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021. Available online: https://openreview.net/forum?id=TAXFsg6ZaOl (accessed on 29 September 2023).
- Abdeldayem, M. The Fake-or-Real Dataset. Kaggle Dataset. 2022. Available online: https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset (accessed on 28 May 2024).
- Sahidullah, M.; Kinnunen, T.; Hanilçi, C. A Comparison of Features for Synthetic Speech Detection. Interspeech 2015, 2015, 2087–2091. [Google Scholar] [CrossRef]
- Zheng, F.; Zhang, G. Integrating the energy information into MFCC. In Proceedings of the 6th International Conference on Spoken Language Processing (ICSLP 2000), Beijing, China, 16–20 October 2000; Volume 1, pp. 389–392. [Google Scholar] [CrossRef]
- Todisco, M.; Delgado, H.; Evans, N. Constant Q Cepstral Coefficients: A Spoofing Countermeasure for Automatic Speaker Verification. Comput. Speech Lang. 2017, 45, 516–535. [Google Scholar] [CrossRef]
- Khalid, H.; Kim, M.; Tariq, S.; Woo, S.S. Evaluation of an audio-video multimodal deepfake dataset using unimodal and multimodal detectors. In Proceedings of the 1st Workshop on Synthetic Multimedia-Audiovisual Deepfake Generation and Detection, Virtual Event, 24 October 2021; pp. 7–15. [Google Scholar]
- Alzantot, M.; Wang, Z.; Srivastava, M.B. Deep residual neural networks for audio spoofing detection. arXiv 2019, arXiv:1907.00501. [Google Scholar]
Paper Title | Main Idea | Model | Language | Fakeness Type | Dataset |
---|---|---|---|---|---|
Arabic Audio Clips: Identification and Discrimination of Authentic Cantillations from Imitations | 1—Compare classifiers’ performance. 2—Compare recognition with human experts. 3—Authentic reciter identification. | 1—Classic 2—Deep learning | Arabic | Imitation | Arabic Diversified Audio |
A Compressed Synthetic Speech Detection Method with Compression Feature Embedding | 1—Detect synthetic speech in compressed formats. 2—Multi-branch residual network. 3— Evaluate method on ASVspoof. 4—Compare with state of the art. | Deep learning (DNN) | Not specified | Synthetic | ASVspoof |
Learning Efficient Representations for Fake Speech Detection | 1—Develop efficient fake speech detection models. 2—Ensure accuracy. 3—Adapt to new forms and sources of fake speech. | 1—Machine learning (CNN, RNN, SVM, KNN) 2—Deep learning (CNNs, RNNs) | English | Imitation, Synthetic | ASVSpoof, RTVCSpoof |
WaveFake: A Dataset for Audio Deepfake Detection | 1—Create a dataset of fake audio. 2—Provide baseline models for detection. 3—Test generalization on data from different techniques. 4—Test on real-life scenarios. | Fake audio synthesizing: MelGAN, Parallel WaveGAN, Multi-band MelGAN, Full-band MelGAN, HiFi-GAN, WaveGlow Baseline detection models: Gaussian Mixture Model (GMM), RawNet2 | English, Japanese | Synthetic | WaveFake |
Attack Agnostic Dataset: Generalization and Stabilization of Audio Deepfake Detection | 1—Analyze and compare model generalization. 2—Combine multiple datasets. 3—Train on inaudible features. 4—Monitor training for stability. | LCNN, XceptionNet, MesoInception, RawNet2, and GMM | English | Synthetic, Imitation | WaveFake, FakeAVCeleb, ASVspoof |
Exposing AI-Synthesized Human Voices Using Neural Vocoder Artifacts | 1—Utilize the identification of vocoders’ artifacts in audio to detect deepfakes. 2—Create a dataset of fake audio using six vocoders. | Detection model: RawNet2 Dataset construction: WaveNet, WaveRNN, Mel-GAN, Parallel WaveGAN, WaveGrad, and DiffWave | English | Synthetic | LibriVoc, WaveFake, and ASVspoof2019 |
Name | Size | Length | Sample Rate | File Format | URL | |
---|---|---|---|---|---|---|
Real | Fake | |||||
ASVspoof2019 | 2580 files | 22,800 files | Avg. 3 s | - | flac | https://datashare.ed.ac.uk/handle/10283/3336 (accessed on 29 September 2023) |
‘In-the-Wild’ | 19,963 files | 11,816 files | Avg. 4.s | 16 kHz | WAV | https://deepfake-demo.aisec.fraunhofer.de/in_the_wild (accessed on 29 September 2023) |
FakeAVCeleb | 10,209 files | 11,357 files | Avg. 5 s | - | MP4 | https://sites.google.com/view/fakeavcelebdash-lab/home?authuser=0 (accessed on 29 September 2023) |
Fake-or-Real | 111,000 files | 87,000 files | 1–20 s | - | WAV/MP3 | https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset (accessed on 29 September 2023) |
Feature | EER | Accuracy | F1 |
---|---|---|---|
CQT | 0.0757 | 94.15% | 95.76% |
LFCC | 0.0160 | 98.27% | 98.65% |
MFCC | 0.0185 | 98.04% | 98.45% |
Feature | EER | Accuracy | F1 |
---|---|---|---|
CQT | 0.0942 | 82.51% | 83.19% |
LFCC | 0.1718 | 75.28% | 72.63% |
MFCC | 0.3165 | 61.24% | 54.94% |
Feature | EER | Accuracy | F1 |
---|---|---|---|
CQT and LFCC | 0.0851 | 84.92% | 84.73% |
Dataset | Model | Features | EER | Accuracy |
---|---|---|---|---|
FakeAVCeleb | Xception [20] | MFCC | 0.2472 | 73.06% |
In-the-Wild | MesoInception [10] | Log-spectrogram | 0.37414 | - |
ASVspoof2019 | CQCC-ResNet [21] | CQCC | 0.0769 | - |
Combined | Sonic Sleuth | CQT | 0.0942 | 82.51% |
LFCC | 0.1718 | 75.28% | ||
MFCC | 0.3165 | 61.24% |
Feature Extraction | Dataset | EER | Accuracy |
---|---|---|---|
LFCC | Training | 0.0160 | 98.27% |
CQT | External | 0.0942 | 82.51% |
LFCC + CQT (Ensemble) | External | 0.0851 | 84.92% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alshehri, A.; Almalki, D.; Alharbi, E.; Albaradei, S. Audio Deep Fake Detection with Sonic Sleuth Model. Computers 2024, 13, 256. https://doi.org/10.3390/computers13100256
Alshehri A, Almalki D, Alharbi E, Albaradei S. Audio Deep Fake Detection with Sonic Sleuth Model. Computers. 2024; 13(10):256. https://doi.org/10.3390/computers13100256
Chicago/Turabian StyleAlshehri, Anfal, Danah Almalki, Eaman Alharbi, and Somayah Albaradei. 2024. "Audio Deep Fake Detection with Sonic Sleuth Model" Computers 13, no. 10: 256. https://doi.org/10.3390/computers13100256
APA StyleAlshehri, A., Almalki, D., Alharbi, E., & Albaradei, S. (2024). Audio Deep Fake Detection with Sonic Sleuth Model. Computers, 13(10), 256. https://doi.org/10.3390/computers13100256