Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder
Abstract
:1. Introduction
- We propose a VC framework using an autoregressive conversion model to obtain acoustic features with higher precision, thereby generating a smooth trajectory and reducing speech error problems.
- We use a high-fidelity LPCNet-based vocoder, which improves the efficiency of speech synthesis and can generate speech in real time.
- We leverage the use of SI-PPGs, which exclude the attention-based duration conversion module. Additionally, we incorporate speaker embeddings obtained from the speaker encoder network as auxiliary features, which improves the overall training stability and minimizes pronunciation artifacts.
- We evaluate the effectiveness of our system by performing “any-to-one” voice conversion pairs on the popular American CMU-ARCTIC database.
2. Related Work
3. Method
3.1. Linguistic Features Extraction
3.2. Autoregressive Conversion Model
3.3. LPCNet Synthesizer
3.4. Model Training
4. Experiments
4.1. Database
4.2. Implementation Details
4.3. Compared Methods
- Baseline system 1 (S1): Refers to parallel VC system [11] based on ASR and text-to-speech (TTS)-oriented pretraining strategy using Transformer models for sequence-to-sequence VC. This method provides a significant improvement in performance in terms of intelligibility and speech quality, even when training data are limited.
- Baseline system 2 (S2): Refers to parallel baseline VC system [10] based on sequence-to-sequence mapping model with attention, which achieved better performance on naturalness and speaker similarity when compared with conventional methods.
- Baseline system 3 (S3): Refers to non-parallel VC system [49] based on a variant of the GAN model called StarGAN. This system can generate converted speech signals at a high speed, allowing for real-time applications and requiring only a few minutes of training to produce realistic speech.
- Baseline system 4 (S4): Refers to non-parallel baseline VC system [41], which aims to jointly train conversion model and WaveNet vocoder using mel-spectrograms and Phonetic Posteriorgrams.
5. Results and Discussion
5.1. Objective Evaluations
5.2. Subjective Evaluations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sisman, B.; Yamagishi, J.; King, S.; Li, H. An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar]
- Walczyna, T.; Piotrowski, Z. Overview of Voice Conversion Methods Based on Deep Learning. Appl. Sci. 2023, 13, 3100. [Google Scholar] [CrossRef]
- Toda, T.; Black, A.W.; Tokuda, K. Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. ASLP 2007, 15, 2222–2235. [Google Scholar] [CrossRef]
- Helander, E.; Virtanen, T.; Nurminen, J.; Gabbouj, M. Voice conversion using partial least squares regression. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 912–921. [Google Scholar]
- Erro, D.; Alonso, A.; Serrano, L.; Navas, E.; Hernáez, I. Towards physically interpretable parametric voice conversion functions. In Proceedings of the 6th Advances in Nonlinear Speech Processing International Conference, Mons, Belgium, 19–21 June 2013; pp. 75–82. [Google Scholar]
- Tian, X.; Wu, Z.; Lee, S.W.; Hy, N.Q.; Chng, E.S.; Dong, M. Sparse representation for frequency warping based voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 4235–4239. [Google Scholar]
- Nguyen, H.Q.; Lee, S.W.; Tian, X.; Dong, M.; Chng, E.S. High-quality voice conversion using prosodic and high-resolution spectral features. Multimed. Tools Appl. 2016, 75, 5265–5285. [Google Scholar]
- Zhao, Y.; Huang, W.-C.; Tian, X.; Yamagishi, J.; Das, R.K.; Kinnunen, T.; Ling, Z.; Toda, T. Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion. arXiv 2020, arXiv:2008.12527. [Google Scholar]
- Liu, R.; Chen, X.; Wen, X. Voice conversion with transformer network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; p. 7759. [Google Scholar]
- Zhang, J.X.; Ling, Z.H.; Liu, L.J.; Jiang, Y.; Dai, L.R. Sequence-to-sequence acoustic modeling for voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 631–644. [Google Scholar]
- Huang, W.C.; Hayashi, T.; Wu, Y.C.; Kameoka, H.; Toda, T. Pretraining techniques for sequence-to-sequence voice conversion. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 745–755. [Google Scholar]
- Ezzine, K.; Frikha, M.; Di Martino, J. Non-Parallel Voice Conversion System Using An Auto-Regressive Model. In Proceedings of the 5th International Conference on Advanced Systems and Emergent Technologies (IC_ASET), Hammamet, Tunisia, 22–25 March 2022; pp. 500–504. [Google Scholar]
- Zhang, M.; Zhou, Y.; Zhao, L.; Li, H. Transfer learning from speech synthesis to voice conversion with non-parallel training data. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1290–1302. [Google Scholar]
- Lee, S.H.; Noh, H.R.; Nam, W.J.; Lee, S.W. Duration controllable voice conversion via phoneme-based information bottleneck. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1173–1183. [Google Scholar] [CrossRef]
- Liu, S.; Cao, Y.; Wang, D.; Wu, X.; Liu, X.; Meng, H. Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 1717–1728. [Google Scholar]
- Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of the EUSIPCO, Rome, Italy, 3–7 September 2018; pp. 2114–2118. [Google Scholar]
- Chun, C.; Lee, Y.H.; Lee, G.W.; Jeon, M.; Kim, H.K. Non-Parallel Voice Conversion Using Cycle-Consistent Adversarial Networks with Self-Supervised Representations. In Proceedings of the 2023 IEEE 20th Consumer Communications & Networking Conference (CCNC), Las Vegas, NV, USA,, 8–11 January 2023; pp. 931–932. [Google Scholar]
- Li, Y.; Qiu, X.; Cao, P.; Zhang, Y.; Bao, B. Non-parallel Voice Conversion Based on Perceptual Star Generative Adversarial Network. Circuits, Syst. Signal Process. 2022, 41, 4632–4648. [Google Scholar]
- Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N. StarGAN-VC: Non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks. In Proceedings of the IEEE SLT, Athens, Greece, 18–21 December 2018; pp. 266–273. [Google Scholar]
- Saito, Y.; Ijima, Y.; Nishida, K.; Takamichi, S. Non-parallel voice conversion using variational autoencoders conditioned by Phonetic PosteriorGrams and d-vectors. In Proceedings of the IEEE ICASSP, Calgary, AB, Canada, 15–20 April 2018; pp. 5274–5278. [Google Scholar]
- Seki, S.; Kameoka, H.; Kaneko, T.; Tanaka, K. Non-parallel Whisper-to-Normal Speaking Style Conversion Using Auxiliary Classifier Variational Autoencoder. IEEE Access 2023, 11, 44590–44599. [Google Scholar]
- Alaa, Y.; Alfonse, M.; Aref, M.M. A survey on generative adversarial networks-based models for Many-to-many non-parallel voice conversion. In Proceedings of the 2022 5th International Conference on Computing and Informatics (ICCI), New Cairo, Egypt, 9–10 March 2022; pp. 221–226. [Google Scholar]
- Zhou, Y.; Tian, X.; Xu, H.; Das, R.K.; Li, H. Cross-lingual voice conversion with bilingual phonetic posterior-gram and average modeling. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6790–6794. [Google Scholar]
- Liu, L.J.; Ling, Z.H.; Jiang, Y.; Zhou, M.; Dai, L.R. WaveNet Vocoder with Limited Training Data for Voice Conversion. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 1983–1987. [Google Scholar]
- Tian, X.; Chng, E.S.; Li, H. A Speaker-Dependent WaveNet for Voice Conversion with Non-Parallel Data. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 201–205. [Google Scholar]
- Guo, H.; Lu, H.; Hu, N.; Zhang, C.; Yang, S.; Xie, L.; Yu, D. Phonetic posteriorgrams based many-to-many singing voice conversion via adversarial training. arXiv 2020, arXiv:2012.01837. [Google Scholar]
- Ho, T.V.; Akagi, M. Cross-lingual voice conversion with controllable speaker individuality using variational autoencoder and star generative adversarial network. IEEE Access 2021, 9, 47503–47515. [Google Scholar]
- Zheng, W.Z.; Han, J.Y.; Lee, C.K.; Lin, Y.Y.; Chang, S.H.; Lai, Y.H. Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients. Comput. Methods Programs Biomed. 2022, 215, 106602. [Google Scholar]
- Kawahara, H.; Masuda-Katsuse, I.; Cheveigne, A. Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 1999, 27, 187–207. [Google Scholar]
- Kalchbrenner, N.; Elsen, E.; Simonyan, K.; Noury, S.; Casagr, E.N.; Lockhart, E.; Kavukcuoglu, K. Efficient neural audio synthesis. In Proceedings of the 35th International Conference on Machine Learning, Stockholm Sweden, 10–15 July 2018; pp. 2410–2419. [Google Scholar]
- Valin, J.M.; Skoglund, J. LPCNet: Improving neural speech synthesis through linear prediction. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5891–5895. [Google Scholar]
- Kanagawa, H.; Ijima, Y. Lightweight LPCNet-Based Neural Vocoder with Tensor Decomposition. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 205–209. [Google Scholar]
- Vipperla, R.; Park, S.; Choo, K.; Ishtiaq, S.; Min, K.; Bhattacharya, S.; Mehrotra, A.; Ramos, A.G.; Lane, N.D. Bunched lpcnet: Vocoder for low-cost neural text-to-speech systems. arXiv 2020, arXiv:2008.04574. [Google Scholar]
- Popov, V.; Kudinov, M.; Sadekova, T. Gaussian LPCNet for multisample speech synthesis. In Proceedings of the 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6204–6208. [Google Scholar]
- Kumar, K.; Kumar, R.; De Boissiere, T.; Gestin, L.; Teoh, W.Z.; Sotelo, J.; De Brebisson, A.; Bengio, Y.; Courville, A.C. Melgan: Generative adversarial networks for conditional waveform synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Yamamoto, R.; Song, E.; Kim, J.M. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6199–6203. [Google Scholar]
- Kong, J.; Kim, J.; Bae, J. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural Inf. Process. Syst. 2020, 33, 17022–17033. [Google Scholar]
- Morrison, M.; Kumar, R.; Kumar, K.; Seetharaman, P.; Courville, A.; Bengio, Y. Chunked autoregressive GAN for conditional waveform synthesis. arXiv 2021, arXiv:2110.10139. [Google Scholar]
- Sun, L.; Li, K.; Wang, H.; Kang, S.; Meng, H. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
- Jia, Y.; Zhang, Y.; Weiss, R.; Wang, Q.; Shen, J.; Ren, F.; Nguyen, P.; Pang, R.; Lopez Moreno, I.; Wu, Y. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
- Liu, S.; Cao, Y.; Wu, X.; Sun, L.; Liu, X.; Meng, H. Jointly Trained Conversion Model and WaveNet Vocoder for Non-Parallel Voice Conversion Using Mel-Spectrograms and Phonetic Posteriorgrams. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 714–718. [Google Scholar]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Vesely, K. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 11–15 December 2011. [Google Scholar]
- Wang, Y.; Skerry-Ryan, R.J.; Stanton, D.; Wu, Y.; Weiss, R.J.; Jaitly, N.; Saurous, R.A. Tacotron: Towards end-to-end speech synthesis. arXiv 2017, arXiv:1703.10135. [Google Scholar]
- Srivastava, R.K.; Greff, K.; Schmidhuber, J. Highway networks. arXiv 2015, arXiv:1505.00387. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Moore, B.C. An Introduction to the Psychology of Hearing; Brill: Leiden, The Netherlands, 2012. [Google Scholar]
- Kominek, J.; Black, A.W. The CMU Arctic speech databases. In Proceedings of the Fifth ISCA Workshop on Speech Synthesis, Pittsburgh, PA, USA, 14–16 June 2004. [Google Scholar]
- Garofolo, J.S. Timit acoustic-phonetic continuous speech corpus. Linguist. Data Consort. 1993. [Google Scholar] [CrossRef]
- Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N. Nonparallel voice conversion with augmented classifier star generative adversarial networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2982–2995. [Google Scholar] [CrossRef]
- Polityko, E. Word Error Rate. MATLAB Central File Exchange. Available online: https://ch.mathworks.com/matlabcentral/fileexchange/55825-word-error-rate (accessed on 25 June 2021).
- Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
- Morise, M.; Yokomori, F.; Ozawa, K. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inform. Syst. 2016, 99, 1877–1884. [Google Scholar] [CrossRef]
Methods | MCD (dB) | ||||
---|---|---|---|---|---|
BDL–>SLT | CLB–>SLT | RMS–>SLT | Average | ||
Para | S1 [11] | 7.08 | 6.63 | 6.88 | 6.86 |
S2 [10] | 7.22 | 6.64 | 7.34 | 7.06 | |
N-Para | S3 [49] | 6.57 | 6.47 | 6.40 | 6.48 |
S4 [41] | 7.17 | 7.31 | 7.11 | 7.19 | |
Proposed | 6.53 | 6.49 | 6.37 | 6.46 |
Methods | WER in (%) | ||
---|---|---|---|
BDL (Male) | CLB (Female) | Average | |
Source | 8.56% | 7.46% | 8.01% |
Proposed | 28.89% | 27.69% | 28.29% |
S1 | 37.39% | 34.19% | 35.79% |
S2 | 32.67% | 29.87% | 31.54% |
S3 | 41.33% | 43.03% | 42.18% |
S4 | 50.6% | 48.76% | 49.68% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ezzine, K.; Di Martino, J.; Frikha, M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Appl. Sci. 2023, 13, 11988. https://doi.org/10.3390/app132111988
Ezzine K, Di Martino J, Frikha M. Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Applied Sciences. 2023; 13(21):11988. https://doi.org/10.3390/app132111988
Chicago/Turabian StyleEzzine, Kadria, Joseph Di Martino, and Mondher Frikha. 2023. "Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder" Applied Sciences 13, no. 21: 11988. https://doi.org/10.3390/app132111988
APA StyleEzzine, K., Di Martino, J., & Frikha, M. (2023). Any-to-One Non-Parallel Voice Conversion System Using an Autoregressive Conversion Model and LPCNet Vocoder. Applied Sciences, 13(21), 11988. https://doi.org/10.3390/app132111988