CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Yook, Dongsuk; Han, Geonhee; Chang, Hyung-Pil; Yoo, In-Chul

doi:10.3390/app14209595

Open AccessArticle

CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Artificial Intelligence Laboratory, Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(20), 9595; https://doi.org/10.3390/app14209595

Submission received: 3 October 2024 / Revised: 15 October 2024 / Accepted: 18 October 2024 / Published: 21 October 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Voice conversion (VC) refers to the technique of modifying one speaker’s voice to mimic another’s while retaining the original linguistic content. This technology finds its applications in fields such as speech synthesis, accent modification, medicine, security, privacy, and entertainment. Among the various deep generative models used for voice conversion, including variational autoencoders (VAEs) and generative adversarial networks (GANs), diffusion models (DMs) have recently gained attention as promising methods due to their training stability and strong performance in data generation. Nevertheless, traditional DMs focus mainly on learning reconstruction paths like VAEs, rather than conversion paths as GANs do, thereby restricting the quality of the converted speech. To overcome this limitation and enhance voice conversion performance, we propose a cycle-consistent diffusion (CycleDiffusion) model, which comprises two DMs: one for converting the source speaker’s voice to the target speaker’s voice and the other for converting it back to the source speaker’s voice. By employing two DMs and enforcing a cycle consistency loss, the CycleDiffusion model effectively learns both reconstruction and conversion paths, producing high-quality converted speech. The effectiveness of the proposed model in voice conversion is validated through experiments using the VCTK (Voice Cloning Toolkit) dataset.

Keywords:

cycle consistency; diffusion model; voice conversion

1. Introduction

Voice conversion (VC) is the process of converting the speech of one speaker into the speech of another speaker, while retaining the linguistic information of the utterance [1,2]. Application areas for voice conversion include speech synthesis, accent conversion, medicine, security, privacy, and entertainment. VC typically consists of three steps: feature extraction, feature transformation, and waveform reconstruction. Since spectrograms are typically used as the extracted speech features and neural vocoders are utilized for waveform reconstruction, most recent research efforts focus on feature transformation through deep neural networks (DNNs), which convert the spectrogram of a source speaker’s utterance into that of a target speaker’s utterance.

The variational autoencoder (VAE) [3] is one of the first deep learning models to be applied to VC [4,5,6]. A VAE consists of an encoder and a decoder. The encoder of the VAE is trained to generate the low-dimensional latent vectors that capture only the linguistic information from the input speech feature vectors. The decoder of the VAE is trained to reconstruct the original input by incorporating the input speaker information into the latent vectors. Voice conversion is accomplished by providing the desired target speaker information rather than the input speaker information to the latent vectors. Although it enables many-to-many voice conversion using non-parallel training data, which does not require the same linguistic content spoken by different speakers, the quality of the converted speech is somewhat lower compared to the methods based on generative adversarial networks (GANs) [7].

Since the cycle-consistent adversarial network (CycleGAN) [8] was introduced for image-to-image translation, GAN-based voice conversion has been extensively studied [9,10,11,12,13,14,15,16]. A CycleGAN comprises two GANs. One GAN is trained to transform the voice of the source speaker into that of the target speaker, while the other GAN is trained to revert the target speaker’s voice back to the original source speaker’s voice. This approach utilizes non-parallel training data and, unlike VAEs, it shows superior quality in converted speech by learning voice conversion paths explicitly, instead of just input reconstruction paths. Nonetheless, GANs suffer from training instability and mode collapse, and building many-to-many voice conversion models with CycleGANs involves significant computational complexity.

Recently, diffusion models (DMs) [17,18,19,20,21,22,23,24] have emerged as promising generative models. A DM is trained to reconstruct the original input data by progressively adding noise to the input data through a forward diffusion process, resulting in a noise distribution, and then progressively removing the noise through a reverse diffusion process. DMs are characterized by stability during training and demonstrate excellent performance in generating high-quality data via reverse diffusion. DMs have been applied not only to image generation but also to voice conversion [25,26,27]. However, DMs fundamentally learn only the reconstruction paths, similar to VAEs, rather than the conversion paths as GANs do, leaving much to be improved in the quality of the converted speech. To address this limitation of conventional DMs and enhance the quality of converted speech, we propose a cycle-consistent diffusion (CycleDiffusion) model, which consists of two DMs: one to convert the source speaker’s voice to the target speaker’s voice, and the other to revert the target speaker’s voice back to the source speaker’s voice. By employing two DMs and enforcing a cycle consistency loss, the proposed CycleDiffusion model can effectively learn not only the reconstruction paths but also the conversion paths, leading to high-quality converted speech. The effectiveness of the proposed model for voice conversion is experimentally validated using VCTK (Voice Cloning Toolkit) data.

The contributions of this paper are as follows:

We propose a novel CycleDiffusion model that can learn both the reconstruction and conversion paths.
We provide an efficient training algorithm so that the proposed CycleDiffusion model can be reliably trained.
The proposed CycleDiffusion model is applied to voice conversion without parallel training data.
We demonstrate the usefulness of the proposed method using VCTK data.

The rest of the paper is structured as follows. Section 2 reviews related previous studies. Section 3 introduces the proposed CycleDiffusion model. Section 4 analyzes the experimental results. Finally, Section 5 concludes the paper. Nomenclature summarizes the notations used throughout the paper.

2. Related Works

In this section, we review the DM-based voice conversion method [23] and the CycleGAN-based voice conversion method [12].

2.1. Diffusion Model-Based Voice Conversion

A DM involves two processes: forward diffusion and reverse diffusion. The forward diffusion process progressively introduces Gaussian noise into the input data, whereas the reverse diffusion process attempts to eliminate this noise. In the context of voice conversion, these forward and reverse diffusions can be represented by the following stochastic differential equations (SDEs):

d x_{t} = \frac{1}{2} β_{t} (\bar{x} - x_{t}) d t + \sqrt{β_{t}} d w_{t}, forward diffusion

(1)

d {\tilde{x}}_{t} = (\frac{1}{2} (\bar{x} - {\tilde{x}}_{t}) - \nabla \log p ({\tilde{x}}_{t})) β_{t} d t + \sqrt{β_{t}} d {\tilde{w}}_{t}, reverse diffusion

(2)

where

x_{t}

is a forward-time stochastic process at time

t

with

x_{0}

being an input speech feature vector,

\bar{x}

is the linguistic information contained in

x_{0}

,

{\tilde{x}}_{t}

is a reverse-time stochastic process at time

t

with

{\tilde{x}}_{0}

being a reconstructed speech feature vector,

w

and

\tilde{w}

are the forward- and reverse-time Wiener processes, respectively;

β_{t}

is a noise schedule, and

t \in [0, T]

is a continuous time variable (

T

is generally set to 1). When

{\tilde{x}}_{T}

is the linguistic information contained in

x_{0}

without any speaker details, the reverse-time stochastic process,

{\tilde{x}}_{t}

, gradually reconstructs the input speech feature vector, incorporating the input speaker characteristics. The reverse-time Wiener process,

\tilde{w}

, mirrors the forward-time Wiener process, except that it operates with time flowing backward, from

T

to 0.

A DM is trained to minimize the discrepancy between the trajectories of the forward and reverse diffusion processes, enforcing

{\tilde{x}}_{0}

similar to

x_{0}

, as an autoencoder. This can be accomplished as follows. First, note that the forward diffusion SDE, Equation (1), yields the following derivation:

p (x_{t}| x_{0}) = N (x_{t}; α_{t} x_{0} + (1 - α_{t}) \bar{x}, (1 - α_{t}^{2}) I),

(3)

where

α_{t}

is dependent on the noise schedule. Consequently,

p (x_{t}| x_{0})

evolves into

N (x_{t}; \bar{x}, I)

for a noise schedule where

α_{t}

approaches

- \infty

, meaning that the final latent vector

x_{T}

contains only linguistic information without any speaker information. Since

p (x_{t}| x_{0})

is a Gaussian distribution, the gradient of the log probability density with respect to the data (known as the score) can be easily computed. Then,

\nabla \log p ({\tilde{x}}_{t})

in the reverse diffusion SDE, Equation (2), is replaced with an estimation obtained by minimizing the following loss function:

L_{d i f f u s i o n} (x_{0}^{ζ}, ζ) = E_{t} [(1 - α_{t}^{2}) E_{x_{0}^{ζ}} [E_{x_{t}^{ζ} | x_{0}^{ζ}} [{‖S_{θ} (x_{t}^{ζ}, \bar{x}, ζ, t) - \nabla \log p (x_{t}^{ζ}| x_{0}^{ζ})‖}_{2}^{2}]]],

(4)

where

x_{0}^{ζ}

represents an input speech feature vector from speaker

ζ

,

S_{θ}

indicates the estimated score function implemented using a DNN with parameters

θ

, and

{‖\cdot‖}_{2}

denotes the ℓ

l^{2}

norm. An encoder is separately trained to extract speaker-independent linguistic information

\bar{x}

from the input speech feature vector

x_{0}^{ζ}

.

Once the score function

S_{θ}

is well trained, voice conversion from speaker

ζ

to speaker

ξ

, for the input speech feature vector

x_{0}^{ζ}

from speaker

ζ

, is achieved by sampling

x_{T}^{ζ}

via Equation (3),

x_{T}^{ζ} = F (x_{0}^{ζ}) ~ p (x_{T}^{ζ}| x_{0}^{ζ}),

(5)

where

F (x_{0}^{ζ})

represents the result of forward diffusion process at time

T

, and solving the reverse diffusion SDE to generate the converted speech feature vector

{\tilde{x}}_{0}^{ζ \to ξ}

of speaker

ξ

as follows:

{\tilde{x}}_{0}^{ζ \to ξ} = R (x_{T}^{ζ}, ξ),

(6)

where

R (x_{T}^{ζ}, ξ)

denotes the solution to Equation (2) at time

t = 0

with the score function replaced by

S_{θ} ({\tilde{x}}_{t}^{ξ}, \bar{x}, ξ, t)

and

x_{T}^{ζ}

used instead of

x_{T}^{ξ}

. Note that the speaker embedding for the source speaker

ζ

is replaced with the target speaker

ξ

during inference.

One limitation of DM-based voice conversion is that the score function

S_{θ}

is trained to reconstruct the input, as shown in Equation (4), while it is utilized in Equation (6) to convert the input speech from a source speaker to a target speaker. Therefore, it is preferable to train

S_{θ}

for conversion rather than just reconstruction.

2.2. CycleGAN-Based Voice Conversion

The CycleGAN-based voice conversion technique employs two GANs; i.e., two sets of generator and discriminator pairs. One generator transforms the speech feature vectors from the source speaker to those of the target speaker, while the other does the reverse. During training, the adversarial loss from the discriminators ensures the correct conversion of speaker characteristics in the feature vectors, while the cycle consistency loss between the input to the first generator and the output of the second generator ensures that the linguistic content of the speech feature vectors remains unchanged.

For instance, an input speech feature vector

x^{ζ}

(omitting the time index subscript hereafter for brevity) from speaker

ζ

is transformed into

x^{ζ \to ξ}

using the first generator

G_{1}

:

x^{ζ \to ξ} = G_{1} (x^{ζ}) .

(7)

The first discriminator

D_{1} (x^{ζ \to ξ})

provides the adversarial loss for

x^{ζ \to ξ}

so that the speaker characteristic in the feature vector changes from speaker

ζ

to speaker

ξ

. The converted feature vector

x^{ζ \to ξ}

is then transformed back to

x^{ζ \to ξ \to ζ}

by the second generator

G_{2}

:

x^{ζ \to ξ \to ζ} = G_{2} (x^{ζ \to ξ}),

(8)

which reverts the speaker characteristic in the feature vector from speaker

ξ

back to speaker

ζ

. The second discriminator

D_{2} (x^{ζ \to ξ \to ζ})

provides the adversarial loss for

x^{ζ \to ξ \to ζ}

. The following cycle consistency loss between the original input vector

x^{ζ}

and the reconstructed vector

x^{ζ \to ξ \to ζ}

maintains the preservation of linguistic content:

L_{c y c l e} (x^{ζ}, x^{ζ \to ξ \to ζ}) = {‖x^{ζ} - x^{ζ \to ξ \to ζ}‖}_{1},

(9)

where

{‖\cdot‖}_{1}

represents the

l^{1}

norm.

Similarly, an input speech feature vector

x^{ξ}

from speaker

ξ

is converted to

x^{ξ \to ζ}

and then back again to

x^{ξ \to ζ \to ξ}

by the generators

G_{2}

and

G_{1}

, respectively, providing another cycle consistency loss. The discriminators

D_{2}

and

D_{1}

provide the adversarial losses for

x^{ξ \to ζ}

and

x^{ξ \to ζ \to ξ}

, respectively. By using both adversarial loss and cycle consistency loss together, a CycleGAN can directly learn the conversion path without the need for parallel training data.

The recently proposed MaskCycleGAN-VC [12], which is a variant of CycleGAN-based voice conversion, and wav2wav [16], which utilizes cycle consistency for wave-to-wave voice conversion, have demonstrated high performance. However, GANs are prone to training instability and mode collapse, and developing many-to-many voice conversion models with CycleGANs entails substantial computational complexity.

3. Proposed Method

3.1. Cycle-Consistent Diffusion (CycleDiffusion)

As detailed in Section 2.1, conventional DM-based voice conversion learns only the reconstruction path as a VAE does and does not learn the conversion path, resulting in significant room for improvement in the quality of the converted speech. Therefore, we propose to utilize cycle consistency loss for training diffusion models, referred to as cycle-consistent diffusion (or CycleDiffusion for short). The proposed method aims to learn both the reconstruction path and the conversion path while enforcing that linguistic information remains unchanged.

Figure 1 depicts the proposed CycleDiffusion training for a two-speaker scenario. Initially, the DM is trained to reconstruct the input. The input

x^{ζ}

from speaker

ζ

is reconstructed to

{\tilde{x}}^{ζ}

, and the input

x^{ξ}

from speaker

ξ

is reconstructed to

{\tilde{x}}^{ξ}

, which is typical DM training using

L_{d i f f u s i o n} (x^{ζ}, ζ)

and

L_{d i f f u s i o n} (x^{ξ}, ξ)

, respectively. As the usual reconstruction path training stabilizes, the conversion path training commences as follows. The input

x^{ζ}

is converted to

x^{ζ \to ξ}

through Equations (5) and (6):

x^{ζ \to ξ} = R (F (x^{ζ}), ξ) .

(10)

Subsequently, the converted speech vector

x^{ζ \to ξ}

is reverted to

x^{ζ \to ξ \to ζ}

again via Equations (5) and (6):

x^{ζ \to ξ \to ζ} = R (F (x^{ζ \to ξ}), ζ) .

(11)

Finally, the cycle consistency loss between

x^{ζ}

and

x^{ζ \to ξ \to ζ}

is minimized using

L_{c y c l e} (x^{ζ}, x^{ζ \to ξ \to ζ})

. Similarly, the cycle consistency loss between

x^{ξ}

and

x^{ξ \to ζ \to ξ}

is minimized as well.

This cycle consistency loss minimization helps improve voice conversion performance in two aspects. First, the conversion path from speaker

ξ

to speaker

ζ

is explicitly trained by reducing the difference between the original target speaker’s speech

x^{ζ}

and the converted speech

x^{ζ \to ξ \to ζ}

, which originates from the source speaker’s speech

x^{ζ \to ξ}

. That is, the input

x^{ζ \to ξ}

comes from speaker

ξ

and the output

x^{ζ \to ξ \to ζ}

belongs to speaker

ζ

. Similarly, the conversion path from speaker

ζ

to speaker

ξ

is explicitly trained as well. Second, since both

x^{ζ}

and

x^{ζ \to ξ \to ζ}

(as well as

x^{ξ}

and

x^{ξ \to ζ \to ξ}

) should represent the same utterance, minimizing their differences ensures that the linguistic information remains unchanged.

The overall loss function for CycleDiffusion training is as follows:

L = L_{d i f f u s i o n} (x^{ζ}, ζ) + L_{d i f f u s i o n} (x^{ξ}, ξ) + λ_{i} (L_{c y c l e} (x^{ζ}, x^{ζ \to ξ \to ζ}) + L_{c y c l e} (x^{ξ}, x^{ξ \to ζ \to ξ})),

(12)

where

λ_{i}

is a weight schedule for the cycle consistency loss, starting at 0 and gradually increasing to

λ_{m a x}

as the training progresses. This approach can be easily extended to accommodate multiple speakers.

3.2. Training Algorithm

A pseudo-code of CycleDiffusion training for many-to-many voice conversion is shown in Algorithm 1, where

η

denotes the learning rate. The decrease in loss values may serve as a termination criterion for the algorithm.

3.3. Comparison to Similar Works

The concept of cycle consistency for DMs was presented in a document enhancement task [28]. Nevertheless, it did not employ the cycle consistency loss for DM training. Rather, two independently trained DMs—one in the noisy document domain and the other in the clean document domain—were linked at the latent space during inference, enabling cyclic conversion. The cycle consistency loss for DM training was initially explored in [29] and applied to image-to-image translation tasks. CycleDiffusion differs from [29] in that it uses score-based diffusion models, which are a continuous-time version of DMs, and is applied to voice conversion tasks; whereas [29] used denoising diffusion probabilistic models, which are a discrete-time version of DMs. To the best of the authors’ knowledge, CycleDiffusion is the first work employing cycle consistency loss for score-based DM training in voice conversion tasks.

4. Experiments

To evaluate the effectiveness of the proposed method, we used a subset of the VCTK dataset [30], which is commonly employed in voice conversion research. This subset comprised two female speakers (P236 and P239) and two male speakers (P259 and P263), who served as both source and target speakers. For simplicity, we will refer to these speakers as F1, F2, M1, and M2, respectively, in the subsequent discussion. Each speaker contributed 461 training utterances and 10 test utterances. With 12 conversion directions and 9 target speaker embedding vectors utilized for each speaker, the overall total of testing samples amounted to 1080 utterances.

We utilized 80-dimensional mel-spectrograms as the speech feature vectors, which were extracted from the downsampled speech waveforms at 22.05 kHz. These spectrograms were computed every 256 samples using a Hanning window of 1024 samples in length. For the mel-spectrogram inversion to waveforms, we employed a pre-trained universal HiFi-GAN vocoder [31].

An adam optimizer with a learning rate of 0.00003 was used for the stochastic gradient descent training in lines 6, 12, and 15 of Algorithm 1. The total number of training epochs was set to 270 after monitoring the reduction of loss values. Following [25], a noise schedule of

β_{t} = β_{0} + (β_{T} - β_{0}) t

with

β_{0} = 0.05

and

β_{T} = 20.0

was employed.

In a preliminary experiment, we compared two cycle consistency training schemes: the first scheme updated all parameters in the dotted and dashed lines in Figure 1, and the second scheme updated the parameters in the dashed lines only. We found that both schemes yielded almost similar performance, but the second method was faster. Therefore, during the training of parameters via gradient descent of the cycle consistency loss, the parameters associated with the dotted lines are frozen, and those involved in the dashed lines are adjusted for faster training. To determine when the conversion path training should become effective in Algorithm 1, we conducted multiple experiments with varying starting points. It was generally observed that initiating the conversion path training yields better results when the reconstruction path training has converged to a certain extent. Therefore,

λ_{i}

was set 0 for

i \in [1,50]

and 1 for

i \in [51,270]

.

DiffVC [23], recognized as the leading VC method using DMs, served as a baseline for comparison. To ensure fair comparisons, the same pretrained encoder was used for both DiffVC and CycleDiffusion.

4.1. Objective Evaluation

4.1.1. Speaker Similarity Test

One of the most widely employed methods to quantify the similarity between two speakers’ voices used in the field of speaker recognition is cosine similarity. This metric is defined as the inner product of two vectors divided by the product of their magnitudes; i.e., the cosine of the angle between them. In speaker recognition, i-vectors [32] or x-vectors [33] are typically utilized to characterize a speaker. We used the Kaldi toolkit [34] to extract both i-vectors and x-vectors, and calculated the cosine similarities between the converted and target speeches. A higher cosine similarity value indicates better voice conversion performance.

Table 1 summarizes the average cosine similarity values along with the corresponding 95% confidence intervals for the converted speeches generated by DiffVC and CycleDiffusion.

The cosine similarity scores for the i-vectors demonstrated a modest enhancement of 10.8% (from 0.4850 to 0.5376). The cosine similarity using the x-vectors showed only marginal improvement. However, the confidence interval contracted by 39.4% (from 0.0033 to 0.0020), indicating a sharper distribution and therefore diminished variance across speakers. On average, the proposed method outperformed the baseline system in the speaker similarity tests using both i-vectors and x-vectors. This suggests that incorporating cycle consistency loss into the diffusion model-based voice conversion training process is an effective strategy for enhancing speaker similarity and achieving more consistent results.

4.1.2. Linguistic Information Preservation Test

Effective voice conversion requires not only a change in speaker characteristics but also the preservation of linguist information. Automatic speech recognition (ASR) can be utilized to quantify how well the linguistic information is maintained in the converted speech by measuring the speech recognition accuracy of the converted speech. We used a publicly available ASR system—Whisper [35]—to measure the speech recognition accuracies. To account for the imperfections of ASR technologies, the speech recognition accuracy of the converted speech is calculated by comparing it against the recognized text of the original input speech, which serves as the reference transcription. An accuracy of 100% signifies complete preservation of the linguistic information, whereas an accuracy of 0% indicates that all linguistic information has been lost during the voice conversion process. Table 2 summarizes the average speech recognition accuracies, along with 95% confidence intervals for the converted speeches generated by DiffVC and CycleDiffusion.

The ASR system achieved a recognition rate of 71.3% on the speech transformed by DiffVC, while the speech converted through CycleDiffusion yielded a superior recognition rate of 74.4%, resulting in a statistically significant relative error reduction rate of 10.8%. The higher speech recognition accuracy achieved by CycleDiffusion can be credited to enhanced intelligibility and better preservation of linguistic information, facilitated by the integration of cycle consistency loss during the training process.

4.1.3. Mel-Cepstral Distance Test

For the last metric of objective evaluation, we used the mel-cepstral distance (MCD) [36] as a quantification of the dissimilarity between the converted speech and the target speech. To compensate for the length discrepancies between the converted and target speeches, dynamic time warping (DTW) was applied before calculating the MCD. The cosine similarity serves as a measure of similarity, while the speech recognition accuracy assesses the quality of linguistic content preservation, and MCD measures overall similarity. A lower MCD value between the converted and the target speeches indicates better voice conversion performance. Table 3 summarizes the average MCDs along with the corresponding 95% confidence intervals for the converted speeches generated by DiffVC and CycleDiffusion.

The experimental findings outlined in Table 3 reveal that the proposed CycleDiffusion model surpasses the baseline DiffVC in both intra-gender and inter-gender voice conversion tasks, making a 15.9% increase in MCD on average. The CycleDiffusion model consistently achieves lower MCD values regardless of conversion directions, indicating improved quality in the converted speech. This decrease in the MCD values highlights the essential role of incorporating cycle consistency loss into the diffusion-based voice conversion framework, significantly improving conversion performance compared to the baseline.

4.2. Subjective Evaluation

Subjective evaluations of sound quality and similarity for the baseline DiffVC and the proposed CycleDiffusion were carried out using the pretrained MOSNet [37], which predicts the human mean opinion score (MOS) rating. The MOS scale ranges from 0 to 5, with 5 indicating the most natural speech and 0 indicating the least natural speech. MOSNet is designed to simulate a human evaluator conducting MOS assessments on input speech samples. Table 4 summarizes the mean MOS values and the corresponding 95% confidence intervals produced by MOSNet for the converted speech outputs generated by DiffVC and CycleDiffusion.

The proposed method surpasses the baseline system in both intra-gender and inter-gender conversion scenarios. It is especially effective in enhancing the quality of voices that already perform well with the existing method (MOS of 3.5 or higher). For example, in the case of F1

\to

M2 conversion, the MOS value increased by approximately 0.49 (from 3.76 to 4.25), while in the case of M2

\to

F2 conversion, it improved by 0.27 (from 3.93 to 4.20). In some cases where the existing method struggled, that is, with MOS values of 3.2 or lower, such as in the case of M1

\to

F1 conversion, it decreased slightly by 0.05 (from 3.18 to 3.13). Despite these few less favorable cases of minor decreases, the average MOS values showed a significant improvement of approximately 5.7% (from 3.50 to 3.70), demonstrating that the voices generated by the proposed method sound more natural. It is worth mentioning that, in addition to the improvement in the average MOS values, the confidence interval decreased by 32.1% on average (from 0.06 to 0.04), indicating less variance in voice conversion performance among different speakers.

4.3. Spectrogrmas

Figure 2 shows the sample spectrograms of the utterances converted by DiffVC and CycleDiffusion, respectively. As demonstrated in the figure, the spectrograms of the utterances processed by CycleDiffusion show more distinct and well-defined formant structures in comparison to those generated by DiffVC.

5. Conclusions

DMs have emerged as a promising approach in the field of voice conversion, leveraging their ability to generate high-quality, diverse data through a two-phase process of forward diffusion and reverse denoising. They ensure stability and quality in data generation compared to traditional methods such as VAEs and GANs. Although DMs for voice conversion have shown substantial potential, conventional DM-based voice conversion methods only learn the reconstruction path, similar to VAEs, and fail to learn the conversion path, leaving much room for improvement in the converted speech quality. To address this, we have proposed CycleDiffusion, which leverages cycle consistency loss for training DMs. The proposed approach aims to learn both the reconstruction and conversion paths, while ensuring that linguistic information remains unchanged.

Both objective and subjective evaluations confirmed that the proposed method significantly enhanced the quality of the transformed speech. In the objective assessment of the speaker similarity, CycleDiffusion achieved an improvement ranging from 10.8% to 38.7%. In the objective evaluation of linguistic information retention, it realized a 10.8% error reduction in speech recognition. During the traditional MCD assessment, it noted a 15.9% enhancement. Finally, in the subjective assessment conducted using MOSNet, the proposed method demonstrated a 5.7% increase in the average MOS value and a 32.1% reduction in the confidence interval.

The time complexity of training the proposed method grows quadratically with respect to the number of training speakers. To reduce the computational burden, we used a subset of each batch when training the conversion paths. In future research, we plan to analyze the trade-off between the amount of data used for conversion path training and the resulting voice conversion performance, aiming to develop an efficient conversion path training strategy. Additionally, another future direction could be integrating cycle consistency loss into a hybrid model that combines DM and GAN.

Author Contributions

Conceptualization, methodology, and writing—original draft preparation, D.Y.; experiment, G.H. (while he was a student at Artificial Intelligence Laboratory in Korea University), H.-P.C. and I.-C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Basic Science Research Program through the National Research Foundation (NRF) of Korea funded by the Ministry of Science, ICT and Future Planning (NRF-2017R1E1A1A01078157). Also, it was supported by the NRF under project BK21 FOUR.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Nomenclature

Symbol	Definition
$x_{0}$	Input speech feature vector
$x_{t}$ $and {\tilde{x}}_{t}$	Forward- and reverse-time stochastic processes at time $t$
$\bar{x}$	$Linguistic information contained in x_{0}$
$w$ $and \tilde{w}$	Forward- and reverse-time Wiener processes
$β_{t}$	$Noise at time t$
$ζ$ $and ξ$	Speaker indices
$S_{θ}$	$DNN with parameters θ$ approximating a score function
$F$	Result of a forward diffusion process
$R$	Solution of a reverse SDE
$G$	GAN generator
$λ_{i}$	$Weight of cycle consistency loss at iteration i$
$η$	Learning rate

References

Mohammadi, S.H.; Kain, A. An overview of voice conversion systems. Speech Commun. 2017, 88, 65–82. [Google Scholar] [CrossRef]
Sisman, B.; Yamagishi, J.; King, S.; Li, H. An overview of voice conversion and its challenges: From statistical modeling to deep learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 29, 132–157. [Google Scholar] [CrossRef]
Kingma, D.; Welling, M. Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Hsu, C.; Hwang, H.; Wu, Y.; Tsao, Y.; Wang, H. Voice conversion from non-parallel corpora using variational autoencoder. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Siem Reap, Cambodia, 9–12 December 2016; pp. 1–6. [Google Scholar]
Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N. ACVAE-VC: Non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1432–1443. [Google Scholar] [CrossRef]
Yook, D.; Leem, S.; Lee, K.; Yoo, I. Many-to-many voice conversion using cycle consistent variational autoencoder with multiple decoders. In Proceedings of the Odyssey 2020 The Speaker and Language Recognition Workshop, Tokyo, Japan, 1–5 November 2020; pp. 215–221. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Zhu, J.; Park, T.; Isola, P.; Efros, A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks. In Proceedings of the European Signal Processing Conference, Rome, Italy, 3–7 September 2018; pp. 2100–2104. [Google Scholar] [CrossRef]
Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC2: Improved CycleGAN-based non-parallel voice conversion. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, 12–17 May 2019; pp. 6820–6824. [Google Scholar] [CrossRef]
Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC3: Examining and improving CycleGAN-VCs for mel-spectrogram conversion. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020; pp. 2017–2021. [Google Scholar] [CrossRef]
Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. MaskCycleGAN-VC: Learning non-parallel voice conversion with filling in frames. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Toronto, ON, Canada, 6–11 June 2021; pp. 5919–5923. [Google Scholar] [CrossRef]
Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N. StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks. In Proceedings of the IEEE Spoken Language Technology Workshop, Athens, Greece, 18–21 December 2018; pp. 266–273. [Google Scholar] [CrossRef]
Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. StarGAN-VC2: Rethinking conditional methods for StarGAN-based voice conversion. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 679–683. [Google Scholar] [CrossRef]
Lee, S.; Ko, B.; Lee, K.; Yoo, I.; Yook, D. Many-to-many voice conversion using conditional cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 6279–6283. [Google Scholar] [CrossRef]
Jeong, C.; Chang, H.; Yoo, I.; Yook, D. wav2wav; wave-to-wave voice conversion. Appl. Sci. 2024, 14, 4251. [Google Scholar] [CrossRef]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
Hyvarinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 605–709. [Google Scholar]
Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 2011, 23, 1661–1674. [Google Scholar] [CrossRef] [PubMed]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference, New Orleans, LA, USA, 21–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.; Zhang, W.; Cui, B.; Yang, M. Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
Popov, V.; Vovk, I.; Gogoryan, V.; Sadekova, T.; Kudinov, M.; Wei, J. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Zhang, X.; Wang, J.; Cheng, N.; Xiao, J. Voice conversion with denoising diffusion probabilistic GAN models. In Proceedings of the International Conference on Advanced Data Mining and Applications, Shenyang, China, 21–23 August 2023; pp. 154–167. [Google Scholar] [CrossRef]
Kameoka, H.; Kaneko, T.; Tanaka, K.; Hojo, N.; Seki, S. VoiceGrad: Non-parallel any-to-many voice conversion with annealed Langevin dynamics. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 2213–2226. [Google Scholar] [CrossRef]
Zhang, J.; Rimchala, J.; Mouatadid, L.; Das, K.; Kumar, S. DECDM: Document enhancement using cycle-consistent diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 8036–8045. [Google Scholar] [CrossRef]
Xu, S.; Ma, Z.; Huang, Y.; Lee1, H.; Chai1, J. CycleNet: Rethinking cycle consistency in text-guided diffusion for image manipulation. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 10359–10384. [Google Scholar]
CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit. Available online: https://datashare.ed.ac.uk/handle/10283/3443 (accessed on 2 October 2024).
Kong, J.; Kim, J.; Bae, J. HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 17022–17033. [Google Scholar]
Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
Kaldi. Available online: https://kaldi-asr.org (accessed on 2 October 2024).
Radford, A.; Kim, J.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Victoria, BC, Canada, 19–21 May 1993; Volume 1, pp. 125–128. [Google Scholar] [CrossRef]
Lo, C.-C.; Fu, S.-W.; Huang, W.-C.; Wang, X.; Yamagishi, J.; Tsao, Y.; Wang, H.-M. MOSNet: Deep learning-based objective assessment for voice conversion. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 1541–1545. [Google Scholar] [CrossRef]

Figure 1. The proposed CycleDiffusion model for a two-speaker case. The solid lines depict the conventional reconstruction path training using

L_{d i f f u s i o n}

, while the dotted and dashed lines represent the novel conversion path training using

L_{c y c l e}

. The proposed method leverages both

L_{d i f f u s i o n}

and

L_{c y c l e}

. The dotted lines indicate inference paths (i.e., voice conversion), which are required for cycle consistency training as well. Specifically, the blue dotted lines indicate the voice conversion from speaker

ζ

to speaker

ξ

, whereas the blue dashed lines indicate the voice conversion from speaker

ξ

to speaker

ζ

. The red lines show the opposite direction of the blue lines. During training parameters via gradient descent of the cycle consistency loss, the parameters associated with the dotted lines are frozen, and those involved in the dashed lines are adjusted.

Figure 1. The proposed CycleDiffusion model for a two-speaker case. The solid lines depict the conventional reconstruction path training using

L_{d i f f u s i o n}

, while the dotted and dashed lines represent the novel conversion path training using

L_{c y c l e}

. The proposed method leverages both

L_{d i f f u s i o n}

and

L_{c y c l e}

. The dotted lines indicate inference paths (i.e., voice conversion), which are required for cycle consistency training as well. Specifically, the blue dotted lines indicate the voice conversion from speaker

ζ

to speaker

ξ

, whereas the blue dashed lines indicate the voice conversion from speaker

ξ

to speaker

ζ

. The red lines show the opposite direction of the blue lines. During training parameters via gradient descent of the cycle consistency loss, the parameters associated with the dotted lines are frozen, and those involved in the dashed lines are adjusted.

Figure 2. Sample spectrograms of the converted utterances using DiffVC and CycleDiffusion: (a) an example where the MOS score of the utterance processed by CycleDiffusion is higher than that of DiffVC; (b) an example where the MOS score of the utterance processed by CycleDiffusion is lower than that of DiffVC.

Table 1. Average cosine similarity values along with the 95% confidence intervals for the speech samples converted by DiffVC (baseline) and CycleDiffusion (proposed).

Method	DiffVC	CycleDiffusion
i-vector	0.4850 ± 0.0049	0.5376 ± 0.0047
x-vector	0.8909 ± 0.0033	0.9070 ± 0.0020
Average	0.6880 ± 0.0041	0.7223 ± 0.0034

Table 2. Average speech recognition accuracies along with the 95% confidence intervals for the speech samples converted by DiffVC (baseline) and CycleDiffusion (proposed).

Method	DiffVC	CycleDiffusion
ASR	71.3 ± 1.4	74.4 ± 1.4

Table 3. Average MCDs along with the 95% confidence intervals for the speech samples converted by DiffVC (baseline) and CycleDiffusion (proposed).

Conversion Direction		DiffVC	CycleDiffusion
Intra-gender	F1 $\to$ F2	5.10 ± 0.24	4.18 ± 0.13
	F2 $\to$ F1	7.18 ± 0.46	6.28 ± 0.38
	M1 $\to$ M2	5.25 ± 0.20	4.52 ± 0.17
	M2 $\to$ M1	6.28 ± 0.28	5.41 ± 0.17
	Average	5.95 ± 0.29	5.10 ± 0.22
Inter-gender	$F 1 \to$ M1	6.12 ± 0.21	4.98 ± 0.15
	$F 1 \to$ M2	5.16 ± 0.20	4.32 ± 0.13
	$F 2 \to$ M1	5.87 ± 0.26	4.85 ± 0.17
	$F 2 \to$ M2	4.88 ± 0.20	4.33 ± 0.18
	$M 1 \to$ F1	7.10 ± 0.38	6.49 ± 0.42
	$M 1 \to$ F2	5.07 ± 0.20	4.36 ± 0.13
	$M 2 \to$ F1	7.16 ± 0.38	6.58 ± 0.37
	$M 2 \to$ F2	5.65 ± 0.26	4.80 ± 0.17
	Average	5.88 ± 0.26	5.09 ± 0.21
Average		5.90 ± 0.27	5.09 ± 0.22

Table 4. Average MOS values, along with the 95% confidence intervals, produced by MOSNet for the speech samples converted by DiffVC (baseline) and CycleDiffusion (proposed).

Conversion Direction		DiffVC	CycleDiffusion
Intra-gender	F1 $\to$ F2	3.90 ± 0.07	4.16 ± 0.03
	F2 $\to$ F1	3.26 ± 0.06	3.19 ± 0.03
	M1 $\to$ M2	3.75 ± 0.07	4.22 ± 0.03
	M2 $\to$ M1	3.28 ± 0.05	3.41 ± 0.07
	Average	3.55 ± 0.06	3.74 ± 0.04
Inter-gender	$F 1 \to$ M1	3.13 ± 0.03	3.22 ± 0.04
	$F 1 \to$ M2	3.76 ± 0.07	4.25 ± 0.03
	$F 2 \to$ M1	3.09 ± 0.03	3.21 ± 0.04
	$F 2 \to$ M2	3.71 ± 0.08	4.18 ± 0.04
	$M 1 \to$ F1	3.18 ± 0.04	3.13 ± 0.03
	$M 1 \to$ F2	3.91 ± 0.07	4.10 ± 0.04
	$M 2 \to$ F1	3.13 ± 0.04	3.12 ± 0.04
	$M 2 \to$ F2	3.93 ± 0.08	4.20 ± 0.04
	Average	3.48 ± 0.06	3.68 ± 0.04
Average		3.50 ± 0.06	3.70 ± 0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yook, D.; Han, G.; Chang, H.-P.; Yoo, I.-C. CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models. Appl. Sci. 2024, 14, 9595. https://doi.org/10.3390/app14209595

AMA Style

Yook D, Han G, Chang H-P, Yoo I-C. CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models. Applied Sciences. 2024; 14(20):9595. https://doi.org/10.3390/app14209595

Chicago/Turabian Style

Yook, Dongsuk, Geonhee Han, Hyung-Pil Chang, and In-Chul Yoo. 2024. "CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models" Applied Sciences 14, no. 20: 9595. https://doi.org/10.3390/app14209595

APA Style

Yook, D., Han, G., Chang, H. -P., & Yoo, I. -C. (2024). CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models. Applied Sciences, 14(20), 9595. https://doi.org/10.3390/app14209595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CycleDiffusion: Voice Conversion Using Cycle-Consistent Diffusion Models

Abstract

1. Introduction

2. Related Works

2.1. Diffusion Model-Based Voice Conversion

2.2. CycleGAN-Based Voice Conversion

3. Proposed Method

3.1. Cycle-Consistent Diffusion (CycleDiffusion)

3.2. Training Algorithm

3.3. Comparison to Similar Works

4. Experiments

4.1. Objective Evaluation

4.1.1. Speaker Similarity Test

4.1.2. Linguistic Information Preservation Test

4.1.3. Mel-Cepstral Distance Test

4.2. Subjective Evaluation

4.3. Spectrogrmas

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI