Recent advances in machine learning, notably large language models (LLMs), have significantly enhanced intelligent virtual assistant technologies, enabling them to engage users in open-ended conversations with unprecedented success. Most of these technologies are embodied and manifested on various XR platforms such as smartphones, computers, and freestanding screens, as well as in virtual/mixed reality environments. Such embodied agents are set to become increasingly integrated into our daily lives to assist us as receptionists in public services, personal trainers in virtual/mixed reality systems, and coaches in physical and mental health activities.
Nonverbal behavior allows us to support the displayed understanding and verbal content through descriptive and iconic gestures. It also allows to create a connection through empathy and trust between the conversation partners, which further increases the effectiveness of communication. For example, in the domain of education and coaching, several studies show that maintaining an appropriate nonverbal behavior results in an increased effectiveness of the learning process [
2,
3,
4]. In education, as well as in other contexts such as tailored care-giving, public reception, and information dissemination, progressively widespread and integrated extended reality (XR) and immersive applications [
4] allow us to exploit the vast knowledge accessible through IT systems in an increasingly engaging and dedicated way for individual users via specific topics, idioms, and nonverbal behavior.
Motivated by this, the main purpose of this paper is to increase the capabilities of the virtual agent while listening. To achieve this goal, we aimed to mimic human natural behavior such as nodding while the interlocutor speaks, mirroring their gestures, and other backchanneling behaviors. We achieved this by exploiting the contextual information in the dyadic conversation to better describe the lexical meaning of the dialogue and extract important insights from the conversation as a whole. Differently from the existing literature, we incorporated multimodal inputs taken from both participants of the conversation and we introduced a new architecture, TAG2G, combining text, audio, and gestures to generate new gestures.
1.1. Related Work
To be perceived as appropriate in dyadic conversations, virtual agents need to master both verbal and nonverbal communication. When humans interact in pairs, they naturally take advantage of both aforementioned communication channels to properly deliver information, making assumptions and sharing ideas while building a strong link with interlocutors based on trust and shared beliefs [
3]. Recent works [
6,
7] reached excellent results when applying data-driven models to verbal communication, leveraging generative pre-trained transformer (i.e., GPT) architectures to handle natural language processing (NLP) tasks and, thus, accurately generating responses and taking an active part in conversations. Conversely, despite its importance, applying data-driven approaches to nonverbal communication is still in its infancy.
Nonverbal behavior is often divided into two different research areas, namely facial expression and body gesture generation. Facial expressions are commonly associated with necessary mouth and facial muscle movements to properly pronounce words and spell letters [
1]. Therefore, a strong link exists between speech and facial expressions. Compared to facial expressions, body motion is a multifaceted and complex problem that concerns a large number of gesture movements that do not necessarily exhibit a strong correlation with the ongoing dialogue. Moreover, there is no mechanical connection between the verbal channel and the body gesture that is employed by a human when interacting with their counterpart. Following this classification, in the following sections, we will further explore and limit our efforts to body gesture generation only.
In the last decade, many works [
1] explored the significance of gestures when exploiting nonverbal channels during a conversation. At first, rule-based approaches [
8,
9,
10] were introduced to deploy human-like motions with virtual agents. Such methods rely on a batch of predetermined, hard-coded motion features that can be adopted to link conversation with body motion given a specified set of rules. However, since these control patterns need to be hard-coded, they are limited in terms of the diversity of implemented actions that are perceived, and, in the long run, they become repetitive and usually lack contextual significance. For this reason, data-driven methods were introduced in order to expand variety and quality of motions while, at the same time, gaining the ability to generalize during the learning procedure. This highlighted the demand for more qualitative and sophisticated approaches to tackle the gesture generation problem via data-driven methods to cope with the expected quality of motion when interacting with a human interlocutor.
Data-driven methods for speech-driven gesture generation are enabled by the presence of multiple publicly available datasets, such as BEAT [
11] and TalkingWithHands 16.2 M (TWH) [
12]. These datasets are commonly used as benchmarks in competitions such as the “Generation and Evaluation of Nonverbal behavior for Embodied Agents” (GENEA) challenge [
5,
13,
14]. These datasets are usually composed of heterogeneous streams of data accurately collected to represent the recorded conversations from a multimodal perspective. Gesture data, especially, are usually provided along with the audio of the speech. Text is often provided, otherwise it can be automatically obtained from audio itself. Moreover, datasets can contain additional information such as unique participant ID, emotional labels of the conversation or ethnicity and other personal details from each single participant. Such supplementary information is highly demanded by researchers to better describe the context of the conversation.
As highlighted by multiple works from the literature [
1,
15,
16], heterogeneous conversational information in the form of audio and text transcription and additional details acts as complementary blocks in order to successfully root nonverbal behavior into verbal communication. Indeed, text has been referred to as the quantitative information needed in order to root a gesture into the contextual meaning of the conversation, thus driving the generation on a long-term span of time. Conversely, audio-controlled gesture generation exhibits very good properties in terms of synchronization with the rhythm of the speech that we, as humans, are commonly used to deploy via tempo-related movements such as hand or head shaking. Consequently, speech-driven gesture generation is demonstrated to rhythmically rely on audio and semantically depend on text. As a result, multimodal input is often considered as the preferred representation of the conversation. In this context, specific procedures are typically used to preprocess raw signals, eventually extracting relevant features [
15,
16]. Multiple baseline pipelines use properties such as prosody, mel-frequencies, and spectral analysis to deploy audio features into a more complex pipeline. In addition, pre-trained models are used to increase the dimensionality of the audio representation into latent space. WavLM [
17] and Hubert [
18] are examples of publicly available models suitable for this specific task. Text is also preprocessed using word-to-vector (Word2Vec) pre-trained models such as Crawl-300D_2M [
19] in order to achieve a semantically meaningful representation of the dialogue. These models are trained to obtain embedding spaces where semantically coherent words are mapped into close points, therefore rooting the comprehension of the model into contextual information of the conversation.
Regarding the neural network architectures, at the very beginning, many researchers [
13,
15] employed deep learning models leveraging recurrent neural network (RNN) layers such as long-short term memory (LSTM) and gated recurrent units (GRUs) to model complex relationships that link gestures to conversation, looking to predict the most appropriate motion. It is worth noting that multiple works [
20,
21] tried to represent gestures as encoded motion features leveraging encoder–decoder architectures such as variational auto-encoders (VAEs) and vector-quantized variational auto-encoders (VQVAEs). This approach is particularly interesting because of the capabilities of aforementioned models to learn latent representations that can be further integrated into more complex pipelines [
22]. More recently, state-of-the-art generative architectures such as generative adversarial networks (GANs) [
23,
24] and probabilistic denoising diffusion models (also known as diffusion models) [
25,
26,
27] have been introduced in the domain of gesture generation, following the excellent results obtained in other realms of generation such as text-to-image tasks. GAN architectures act as useful tools in online gesture generation tasks; however, they are known to be hard to train due to the balancing of generator/discriminator loss. Moreover they highly rely on observed data while falling short in the generation of previously unseen samples, thus producing repetitive motions.
As highlighted in the literature, current limitations in the field of nonverbal behavior generation vary across multiple directions; thus, a plenitude of exploration directions are available. State-of-the-art approaches [
27] are yielding good results in terms of human-likeness and speaker appropriateness of generated motions. On the other hand, above-chance results have been collected when assessing the appropriateness of a generated gesture when the agent is listening to their interlocutor [
5]. Moreover, synthesized motions are scarcely related to an interlocutor’s body gestures. Once the target agent is listening, iconic motions such as nodding, backchanneling and mirroring of the interlocutor’s gestures are rarely visible, therefore highlighting the poor naturalness of generated motions.
In [
28], the authors showed that the nonverbal channel can serve as a research tool to extract interlocutor’s unspoken thoughts. Thus, this “hidden information” should be deciphered and better exploited when trying to predict an appropriate gesture for the agent. Nevertheless, only a handful of works actually explored the chance to extend multimodal input from a single agent to all the people taking part into the interaction [
5,
23,
29,
30]. Among all of the literature, only in [
30] has it been proposed to train two specific models to dynamically interchange depending on whether the main agent is speaking or listening, while the others proposed different approaches to embed the interlocutor’s information into gesture generation. Nevertheless, all proposed methods suffered from a lack of appropriateness while the agent is listening.
On the other hand, these methods suffer from below-average results when addressing prolonged speech delivered by the agent. Contrarily, speaker-only based approaches show enhanced capabilities of human-likeness and gesture appropriateness to the speech while the agent is engaged in a protracted monologue. Nevertheless, it is non-trivial for such approaches to generate accurate and appropriate gestures when listening to the interlocutor due to the lack of information from the counterpart [
5]. As a result, current methods lack consistency with respect to gesture appropriateness between speaking and listening scenarios. In these terms, interlocutor awareness has the potential for acting as a trade-off between an agent’s accurate speech-related gesture and a qualitative interlocutor’s appropriate nonverbal behavior [
30].
1.2. Proposed Contribution
To overcome the limitations of existing works and improve the capabilities of virtual agents towards dyadic conversations, we propose an architecture that combines a VQVAE and a diffusion model. The architecture is called TAG2G, since it leverages conversational information such as text (T), audio (A), and gesture (G), applied to a dyadic setup (2) to address speech-driven gesture (G) generation.
The advantages brought by the proposed approach are twofold. To the best of our knowledge, we are the first to use a dyadic multimodal input to tackle the co-speech gesture generation task, including text, audio, ID, and past observed gestures to predict the next movement. This allows an improvement of the appropriateness of generated gestures. In addition, we introduce a VQVAE to learn a latent representation of gestures in the form of a codebook of atomic actions. This allows to speed up training and inference time as compared with other state-of-the-art algorithms.
The rest of the paper is organized as follows: In
Section 2.1, we formally describe and introduce the task of gesture generation extended to a dyadic multimodal input setup. Then, in
Section 2.2 we delve into the formalization of the proposed architecture, while employed methodologies are treated in
Section 2.4 and
Section 2.5. Materials and software and hardware platforms used during training procedures are highlighted in
Section 2.6 and
Section 2.7; finally, readers can find experimental validation in
Section 3. A discussion on experimental validation, results, and future works is presented in
Section 4 and
Section 5.