DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation
Abstract
:1. Introduction
- We are involved in pioneering the use of the WavLM generative pre-trained transformer large model for extracting style features from raw speech audio features without any manual annotation.
- We extended our previous autoregressive Diffmotion model to a non-autoregressive variant known as DiT-Gestures. This extension encompasses a novel diffusion model, which adopts a transformer-based architecture that replaces the causal mask with a dynamic mask attention network (DMAN). The DMAN effectively enhances the adaptive modeling of local frames.
- Extensive subjective and objective evaluations reveal that our model outperforms the current state-of-the-art approaches. These results show the remarkable capability of our method in generating natural, speech-appropriate, and stylized gestures.
2. Related Work
2.1. Data-Driven Generative Approaches
2.2. Condition Encoding Strategy
2.2.1. Audio Representation
2.2.2. Style Control
3. Proposed Approach
3.1. Problem Formulation
3.2. Model Architecture
3.2.1. Condition Encoder
3.2.2. Gesture Encoder and Decoder
3.3. Transformer with DMAN
3.4. Final Layer
3.5. Training and Inference with Denoising Diffusion Probabilistic Model
Algorithm 1: Training for the whole sequence gestures. |
Input: data and repeat Initialize Uniform and Take gradient step on |
Algorithm 2: Sampling via annealed Langevin dynamics. |
Input: noise and raw audio waveform for to 1 do if then else end if end for Return: |
4. Experiments
4.1. Dataset and Data Processing
4.1.1. Datasets
4.1.2. Speech Audio Data Process
4.1.3. Gesture Data Process
4.2. Model Settings
4.3. Visualization Results
4.4. Comparison
4.4.1. User Study
4.4.2. Objective Evaluation
4.4.3. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Matthew, B. Voice puppetry. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 8–13 August 1999; pp. 21–28. [Google Scholar]
- Zhang, F.; Ji, N.; Gao, F.; Li, Y. DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model. In Proceedings of the MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, 9–12 January 2023; Proceedings, Part I. Springer: Berlin/Heidelberg, Germany, 2023; pp. 231–242. [Google Scholar]
- Sarah, T.; Jonathan, W.; David, G.; Iain, M. Speech-Driven Conversational Agents Using Conditional Flow-VAEs. In Proceedings of the European Conference on Visual Media Production, London, UK, 6–7 December 2021; pp. 1–9. [Google Scholar]
- Alexanderson, S.; Nagy, R.; Beskow, J.; Henter, G.E. Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. arXiv 2022, arXiv:2211.09707. [Google Scholar] [CrossRef]
- Simon, A.; Eje, H.G.; Taras, K.; Jonas, B. Style-Controllable Speech-Driven Gesture Synthesis Using Normalising Flows. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2020; Volume 39, pp. 487–496. [Google Scholar]
- Bhattacharya, U.; Childs, E.; Rewkowski, N.; Manocha, D. Speech2affectivegestures: Synthesizing co-speech gestures with generative adversarial affective expression learning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 2027–2036. [Google Scholar]
- Yang, S.; Wu, Z.; Li, M.; Zhang, Z.; Hao, L.; Bao, W.; Cheng, M.; Xiao, L. DiffuseStyleGesture: Stylized Audio-Driven Co-Speech Gesture Generation with Diffusion Models. arXiv 2023, arXiv:2305.04919. [Google Scholar]
- Yang, S.; Xue, H.; Zhang, Z.; Li, M.; Wu, Z.; Wu, X.; Xu, S.; Dai, Z. The DiffuseStyleGesture+ entry to the GENEA Challenge 2023. arXiv 2023, arXiv:2308.13879. [Google Scholar]
- Li, J.; Kang, D.; Pei, W.; Zhe, X.; Zhang, Y.; Bao, L.; He, Z. Audio2Gestures: Generating Diverse Gestures From Audio. IEEE Trans. Vis. Comput. Graph. 2023, 14, 1–15. [Google Scholar] [CrossRef] [PubMed]
- Ghorbani, S.; Ferstl, Y.; Holden, D.; Troje, N.F.; Carbonneau, M.A. ZeroEGGS: Zero-Shot Example-Based Gesture Generation from Speech. In Computer Graphics Forum; Wiley Online Library: Hoboken, NJ, USA, 2023; Volume 42, pp. 206–216. [Google Scholar]
- Wagner, P.; Malisz, Z.; Kopp, S. Gesture and speech in interaction: An overview. Speech Commun. 2014, 57, 209–232. [Google Scholar] [CrossRef]
- Ylva, F.; Michael, N.; Rachel, M. Multi-objective adversarial gesture generation. In Proceedings of the Motion, Interaction and Games, Newcastle upon Tyne, UK, 28–30 October 2019; pp. 1–10. [Google Scholar]
- Ian, G.; Jean, P.A.; Mehdi, M.; Bing, X.; David, W.F.; Sherjil, O.; Aaron, C.; Yoshua, B. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
- Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. Diffwave: A Versatile Diffusion Model for Audio Synthesis. arXiv 2020, arXiv:2009.09761. [Google Scholar]
- Rezende, D.; Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 1530–1538. [Google Scholar]
- Dinh, L.; Krueger, D.; Bengio, Y. Nice: Non-linear Independent Components Estimation. arXiv 2014, arXiv:1410.8516. [Google Scholar]
- Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density Estimation Using Real Nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
- Jing, L.; Di, K.; Wenjie, P.; Xuefei, Z.; Ying, Z.; Zhenyu, H.; Linchao, B. Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11293–11302. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (PMLR), Lille, France, 7–9 July 2015; pp. 2256–2265. [Google Scholar]
- Rasul, K.; Sheikh, A.S.; Schuster, I.; Bergmann, U.; Vollgraf, R. Multivariate probabilistic time series forecasting via conditioned normalizing flows. arXiv 2020, arXiv:2002.06103. [Google Scholar]
- Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. Adv. Neural Inf. Process. Syst. 2019, 32, 1–13. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y. Conformer: Convolution-Augmented Transformer for Speech Recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Ao, T.; Zhang, Z.; Liu, L. GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents. arXiv 2023, arXiv:2303.14613. [Google Scholar] [CrossRef]
- Windle, J.; Greenwood, D.; Taylor, S. UEA Digital Humans Entry to the GENEA Challenge 2022. In Proceedings of the GENEA: Generation and Evaluation of Non-Verbal Behaviour for Embodied Agents Challenge, Bengaluru, India, 7–11 November 2022. [Google Scholar]
- Cambria, E.; Livingstone, A.; Hussain, A. The hourglass of emotions. In Proceedings of the Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, 21–26 February 2011; Revised Selected Papers. Springer: Berlin/Heidelberg, Germany, 2012; pp. 144–157. [Google Scholar]
- Russell, J. A Circumplex Model of Affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X. Wavlm: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Fan, Z.; Gong, Y.; Liu, D.; Wei, Z.; Wang, S.; Jiao, J.; Duan, N.; Zhang, R.; Huang, X. Mask Attention Networks: Rethinking and Strengthen Transformer. arXiv 2022, arXiv:2203.05297. [Google Scholar]
- Paul, L. Sur la Théorie du Mouvement Brownien. C. R. Acad. Sci. 1908, 65, 530–533+146. [Google Scholar]
- Ylva, F.; Rachel, M. Investigating the Use of Recurrent Motion Modelling for Speech Gesture Generation. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, Sydney, NSW, Australia, 5–8 November 2018; pp. 93–98. [Google Scholar]
- Liu, H.; Zhu, Z.; Iwamoto, N.; Peng, Y.; Li, Z.; Zhou, Y.; Bozkurt, E.; Zheng, B. BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis. arXiv 2022, arXiv:2203.05297. [Google Scholar]
- Grassia, F. Sebastian. Practical Parameterization of Rotations Using the Exponential Map. J. Graph. Tools 1998, 3, 29–48. [Google Scholar] [CrossRef]
- Wennberg, U.; Henter, G.E. The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models. arXiv 2021, arXiv:2106.01950. [Google Scholar]
- Wolfert, P.; Robinson, N.; Belpaeme, T. A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents. IEEE Trans. Hum.-Mach. Syst. 2022, 52, 379–389. [Google Scholar] [CrossRef]
- Kucherenko, T.; Wolfert, P.; Yoon, Y.; Viegas, C.; Nikolov, T.; Tsakov, M.; Henter, G.E. Evaluating Gesture-Generation in a Large-Scale Open Challenge: The GENEA Challenge 2022. arXiv 2023, arXiv:2303.08737. [Google Scholar] [CrossRef]
- Youngwoo, Y.; Bok, C.; Joo-Haeng, L.; Minsu, J.; Jaeyeon, L.; Jaehong, K.; Geehyuk, L. Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. ACM Trans. Graph. 2020, 39, 1–16. [Google Scholar]
- Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. Ai choreographer: Music conditioned 3d dance generation with aist++. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13401–13412. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 1–12. [Google Scholar]
Dataset | Total Time | fps | Rate | Audio Sample Character | Content |
---|---|---|---|---|---|
Trinity | 244 min | 60 | 44 kHz | 1 male | Spontaneous speech on different topics |
ZEGGS | 135 min | 60 | 48 kHz | 1 female | 19 different motion styles |
BEAT | 35 h | 120 | 48 kHz | 30 speakers | Speech on diverse content |
Method | Condition Encoding Strategy | Style Control | Architecture |
---|---|---|---|
SG [5] | MFCCs | Hand height, hand speed, and gesticulation radius | LSTM architecture combined with normalizing flows |
ZeroEGGS [10] | Log amplitude of spectrogram, mel-frequency scale, and energy of audio | Employs example motion clips to influence style in a zero-shot learning framework | Uses a variational framework to learn style embedding |
CAMN [37] | Raw audio and text | Emotion and Speaker ID label | LSTM-based structure |
DSG [7] | Raw audio with WavLM and linear component | Uses classifier-free guidance to adjust stylistic expression based on discrete style labels | Diffusion model with self-cross local attention |
DSG+ [8] | Similar to DSG but augmented with text semantic information | Similar to DSG; however, it employs categorical labels for the representation of distinct personality roles | Similar to DSG |
Diffmotion [2] | MFCCs | No | LSTM-based diffusion model |
DiT-Gestures (ours) | Raw audio with WavLM and Conv component | Raw audio | Diffusion + DMAN transformer |
Methods | Subject Evaluation Metric | Objective Evaluation Metric | |||||
---|---|---|---|---|---|---|---|
Dataset | Model | Human ↑ Likeness | Appropriateness ↑ | Style ↑ Appropriateness | FGD ↓ on Feature Space | FGD ↓ on Raw Data Space | BeatAlign ↑ |
Trinity | GT | 4.32 ± 0.35 | 4.53 ± 0.42 | / | / | / | 0.76 |
SG [5] | 2.11 ± 1.52 | 2.71 ± 0.88 | 2.77 ± 1.15 | 187.32 | 21,568.25 | 0.43 | |
DMV1 [2] | 2.9 ± 0.67 | 2.37 ± 1.26 | 2.61 ± 1.21 | 179.52 | 21,356.86 | 0.50 | |
(Ours) DG-W | 4.30 ± 0.26 | 4.31 ± 0.13 | 4.19 ± 0.82 | 43.52 | 3358.18 | 0.67 | |
(Ours) DG-CM | 4.01 ± 0.70 | 4.12 ± 0.82 | 4.00 ± 0.25 | 46.45 | 3652.12 | 0.60 | |
(Ours) DG-M | 4.22 ± 0.50 | 4.22 ± 1.18 | 4.02 ± 0.75 | 53.56 | 3925.66 | 0.61 | |
ZEGGS | GT | 4.50 ± 0.50 | 4.51 ± 0.50 | / | / | / | 0.81 |
Zero-EGGS [10] | 4.29 ± 0.77 | 4.26 ± 0.78 | 4.11 ± 0.23 | 32.05 | 2886.56 | 0.62 | |
DSG [7] | 4.18 ± 0.84 | 4.15 ± 0.92 | 4.02 ± 0.25 | 33.26 | 3011.22 | 0.63 | |
(Ours) DG-W | 4.30 ± 0.72 | 4.27 ± 0.81 | 4.82 ± 0.32 | 31.96 | 2864.70 | 0.68 | |
(Ours) DG-CM | 3.00 ± 1.42 | 2.95 ± 1.41 | 4.11 ± 1.22 | 36.15 | 3021.53 | 0.62 | |
(Ours) DG-M | 2.96 ± 1.40 | 2.95 ± 1.41 | 3.02 ± 1.28 | 47.24 | 3681.95 | 0.61 | |
BEATS | GT | 4.51 ± 0.50 | 4.50 ± 0.50 | / | / | / | 0.83 |
CaMN [37] | 3.49 ± 1.13 | 3.48 ± 1.12 | 3.48 ± 1.12 | 123.63 | 16,873.89 | 0.63 | |
DSG+ [8] | 4.25 ± 0.75 | 4.24 ± 0.80 | 4.32 ± 0.73 | 18.04 | 1495.65 | 0.59 | |
(Ours) DG-W | 4.31 ± 0.73 | 4.30 ± 0.76 | 4.37 ± 0.70 | 18.04 | 1490.70 | 0.66 | |
(Ours) DG-CM | 4.24 ± 0.43 | 4.16 ± 0.47 | 4.00 ± 0.71 | 38.69 | 2597.23 | 0.61 | |
(Ours) DG-M | 4.23 ± 0.42 | 4.02 ± 0.70 | 3.99 ± 0.70 | 38.78 | 2619.85 | 0.62 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, F.; Wang, Z.; Lyu, X.; Ji, N.; Zhao, S.; Gao, F. DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation. Electronics 2024, 13, 1702. https://doi.org/10.3390/electronics13091702
Zhang F, Wang Z, Lyu X, Ji N, Zhao S, Gao F. DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation. Electronics. 2024; 13(9):1702. https://doi.org/10.3390/electronics13091702
Chicago/Turabian StyleZhang, Fan, Zhaohan Wang, Xin Lyu, Naye Ji, Siyuan Zhao, and Fuxing Gao. 2024. "DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation" Electronics 13, no. 9: 1702. https://doi.org/10.3390/electronics13091702
APA StyleZhang, F., Wang, Z., Lyu, X., Ji, N., Zhao, S., & Gao, F. (2024). DiT-Gesture: A Speech-Only Approach to Stylized Gesture Generation. Electronics, 13(9), 1702. https://doi.org/10.3390/electronics13091702