CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models
Abstract
:1. Introduction
- We propose a novel CycleDiffusion model that can learn both the reconstruction and conversion paths.
- We provide an efficient training algorithm so that the proposed CycleDiffusion model can be reliably trained.
- The proposed CycleDiffusion model is applied to voice conversion without parallel training data.
- We demonstrate the usefulness of the proposed method using VCTK data.
2. Related Works
2.1. Diffusion Model-Based Voice Conversion
2.2. CycleGAN-Based Voice Conversion
3. Proposed Method
3.1. Cycle-Consistent Diffusion (CycleDiffusion)
3.2. Training Algorithm
3.3. Comparison to Similar Works
4. Experiments
4.1. Objective Evaluation
4.1.1. Speaker Similarity Test
4.1.2. Linguistic Information Preservation Test
4.1.3. Mel-Cepstral Distance Test
4.2. Subjective Evaluation
4.3. Spectrogrmas
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Nomenclature
Symbol | Definition |
Input speech feature vector | |
Forward- and reverse-time stochastic processes at time | |
Forward- and reverse-time Wiener processes | |
Speaker indices | |
approximating a score function | |
Result of a forward diffusion process | |
Solution of a reverse SDE | |
GAN generator | |
Learning rate |
References
- Mohammadi, S.H.; Kain, A. An overview of voice conversion systems. Speech Commun. 2017, 88, 65–82. [Google Scholar] [CrossRef]
- Sisman, B.; Yamagishi, J.; King, S.; Li, H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
- Kingma, D.; Welling, M. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Hsu, C.; Hwang, H.; Wu, Y.; Tsao, Y.; Wang, H. Voice conversion from non-parallel corpora using variational autoencoder. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Siem Reap, Cambodia, 9–12 December 2016; pp. 1–6. [Google Scholar]
- Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N. ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1432–1443. [Google Scholar] [CrossRef]
- Yook, D.; Leem, S.; Lee, K.; Yoo, I. Many-to-many voice conversion using cycle consistent variational autoencoder with multiple decoders. In Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 215–221. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Zhu, J.; Park, T.; Isola, P.; Efros, A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of the European Signal Processing Conference, Rome, Italy, 3–7 September 2018; pp. 2100–2104. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6820–6824. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 2017–2021. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. MaskCycleGAN-VC: Learning non-parallel voice conversion with filling in frames. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 5919–5923. [Google Scholar] [CrossRef]
- Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N. StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks. In Proceedings of the IEEE Spoken Language Technology Workshop, Athens, Greece, 18–21 December 2018; pp. 266–273. [Google Scholar] [CrossRef]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 679–683. [Google Scholar] [CrossRef]
- Lee, S.; Ko, B.; Lee, K.; Yoo, I.; Yook, D. Many-to-many voice conversion using conditional cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6279–6283. [Google Scholar] [CrossRef]
- Jeong, C.; Chang, H.; Yoo, I.; Yook, D. wav2wav; wave-to-wave voice conversion. Appl. Sci. 2024, 14, 4251. [Google Scholar] [CrossRef]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
- Hyvarinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 605–709. [Google Scholar]
- Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 2011, 23, 1661–1674. [Google Scholar] [CrossRef] [PubMed]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Song, Y.; Sohl-Dickstein, J.; Kingma, D.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, New Orleans, LA, USA, 21–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
- Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
- Popov, V.; Vovk, I.; Gogoryan, V.; Sadekova, T.; Kudinov, M.; Wei, J. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Zhang, X.; Wang, J.; Cheng, N.; Xiao, J. Voice conversion with denoising diffusion probabilistic GAN models. In Proceedings of the International Conference on Advanced Data Mining and Applications, Shenyang, China, 21–23 August 2023; pp. 154–167. [Google Scholar] [CrossRef]
- Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N.; Seki, S. VoiceGrad: Non-parallel any-to-many voice conversion with annealed Langevin dynamics. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2213–2226. [Google Scholar] [CrossRef]
- Zhang, J.; Rimchala, J.; Mouatadid, L.; Das, K.; Kumar, S. DECDM: Document enhancement using cycle-consistent diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 8036–8045. [Google Scholar] [CrossRef]
- Xu, S.; Ma, Z.; Huang, Y.; Lee1, H.; Chai1, J. CycleNet: Rethinking cycle consistency in text-guided diffusion for image manipulation. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 10359–10384. [Google Scholar]
- CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit. Available online: https://datashare.ed.ac.uk/handle/10283/3443 (accessed on 2 October 2024).
- Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 17022–17033. [Google Scholar]
- Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
- Kaldi. Available online: https://kaldi-asr.org (accessed on 2 October 2024).
- Radford, A.; Kim, J.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
- Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 125–128. [Google Scholar] [CrossRef]
- Lo, C.-C.; Fu, S.-W.; Huang, W.-C.; Wang, X.; Yamagishi, J.; Tsao, Y.; Wang, H.-M. MOSNet: Deep learning-based objective assessment for voice conversion. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 1541–1545. [Google Scholar] [CrossRef]
Method | DiffVC | CycleDiffusion |
---|---|---|
i-vector | 0.4850 ± 0.0049 | 0.5376 ± 0.0047 |
x-vector | 0.8909 ± 0.0033 | 0.9070 ± 0.0020 |
Average | 0.6880 ± 0.0041 | 0.7223 ± 0.0034 |
Method | DiffVC | CycleDiffusion |
---|---|---|
ASR | 71.3 ± 1.4 | 74.4 ± 1.4 |
Conversion Direction | DiffVC | CycleDiffusion | |
---|---|---|---|
Intra-gender | F1F2 | 5.10 ± 0.24 | 4.18 ± 0.13 |
F2F1 | 7.18 ± 0.46 | 6.28 ± 0.38 | |
M1M2 | 5.25 ± 0.20 | 4.52 ± 0.17 | |
M2M1 | 6.28 ± 0.28 | 5.41 ± 0.17 | |
Average | 5.95 ± 0.29 | 5.10 ± 0.22 | |
Inter-gender | M1 | 6.12 ± 0.21 | 4.98 ± 0.15 |
M2 | 5.16 ± 0.20 | 4.32 ± 0.13 | |
M1 | 5.87 ± 0.26 | 4.85 ± 0.17 | |
M2 | 4.88 ± 0.20 | 4.33 ± 0.18 | |
F1 | 7.10 ± 0.38 | 6.49 ± 0.42 | |
F2 | 5.07 ± 0.20 | 4.36 ± 0.13 | |
F1 | 7.16 ± 0.38 | 6.58 ± 0.37 | |
F2 | 5.65 ± 0.26 | 4.80 ± 0.17 | |
Average | 5.88 ± 0.26 | 5.09 ± 0.21 | |
Average | 5.90 ± 0.27 | 5.09 ± 0.22 |
Conversion Direction | DiffVC | CycleDiffusion | |
---|---|---|---|
Intra-gender | F1F2 | 3.90 ± 0.07 | 4.16 ± 0.03 |
F2F1 | 3.26 ± 0.06 | 3.19 ± 0.03 | |
M1M2 | 3.75 ± 0.07 | 4.22 ± 0.03 | |
M2M1 | 3.28 ± 0.05 | 3.41 ± 0.07 | |
Average | 3.55 ± 0.06 | 3.74 ± 0.04 | |
Inter-gender | M1 | 3.13 ± 0.03 | 3.22 ± 0.04 |
M2 | 3.76 ± 0.07 | 4.25 ± 0.03 | |
M1 | 3.09 ± 0.03 | 3.21 ± 0.04 | |
M2 | 3.71 ± 0.08 | 4.18 ± 0.04 | |
F1 | 3.18 ± 0.04 | 3.13 ± 0.03 | |
F2 | 3.91 ± 0.07 | 4.10 ± 0.04 | |
F1 | 3.13 ± 0.04 | 3.12 ± 0.04 | |
F2 | 3.93 ± 0.08 | 4.20 ± 0.04 | |
Average | 3.48 ± 0.06 | 3.68 ± 0.04 | |
Average | 3.50 ± 0.06 | 3.70 ± 0.04 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yook, D.; Han, G.; Chang, H.-P.; Yoo, I.-C. CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models. Appl. Sci. 2024, 14, 9595. https://doi.org/10.3390/app14209595
Yook D, Han G, Chang H-P, Yoo I-C. CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models. Applied Sciences. 2024; 14(20):9595. https://doi.org/10.3390/app14209595
Chicago/Turabian StyleYook, Dongsuk, Geonhee Han, Hyung-Pil Chang, and In-Chul Yoo. 2024. "CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models" Applied Sciences 14, no. 20: 9595. https://doi.org/10.3390/app14209595
APA StyleYook, D., Han, G., Chang, H. -P., & Yoo, I. -C. (2024). CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models. Applied Sciences, 14(20), 9595. https://doi.org/10.3390/app14209595