Dialogue Management and Language Generation for a Robust Conversational Virtual Coach: Validation and User Study
Abstract
:1. Introduction
- On the technical side, we propose a closed-domain planning-based DM to develop complex coaching dialogues. We also present a hybrid NLG based on templates, augmented with a Transformer NN postprocessing and with a part-of-speech tagger based on word vector representations, to implement the coaching language of the VC in three languages under the guidance of health professionals. To put these contributions into effect, an elaborated Dialogue Act (DA) taxonomy that routes the communication between the DM and the NLG has been developed, and also an efficient integration procedure for the assembly of the VC.
- Secondly, we carry out a study to test, analyse and validate the VC in the target community, elderly independent people, in three different countries with three different languages and cultures: Spain, France and Norway.
2. Related Work
2.1. Dialogue Management and Natural Language Generation
2.2. Virtual Coaches
3. Virtual Coach Overview
3.1. Main Components of the Virtual Coach
3.2. Coaching Approach
- Goal Phase.The VC implements Goal Set Questions (GSQ) and Motivational Questions (MQ) to identify and clarify the user’s goals and commitment to them.
- Reality Phase. The VC assesses the current situation via Reality Questions (RQ) and provides an understanding of internal obstacles with Obstacle Questions (OQ).
- Options Phase. The VC implements a variety of Option Generation Questions (OGQ) to identify alternatives to achieve the goal defined by the user. It also uses Obstacle Questions (OQ), in this case, to assess the difficulties of these potential action plans.
- Will Phase. The VC implements a series of Plan Action Questions (PAQ) aimed at defining an action plan to reach the previously set goal.
4. Dialogue Management to Perform Coaching Strategies
4.1. Dialogue Manager
- Dialogue task specification. It follows a hierarchical plan that is defined by a tree of dialogue agents, where each agent is responsible for managing a specific subtask. Two different kinds of agents can be found in the tree:
- -
- Internal agents or non-terminal nodes, which are represented as blue nodes in Figure 4, are used to encapsulate subsections of the dialogues and control the execution of their children agents;
- -
- Terminal nodes, mostly represented in red in the figure, are responsible for implementing precise actions. In this way, Inform nodes produce an output, Request nodes ask for some information from the user, and Expect nodes continuously listen for some information without requesting it. The green nodes in the figure are Execute nodes connected to other modules of the SDS.
- Dialogue management. The DM executes a given dialogue task specification tree that is traversed in Depth First order. However, this order can be altered under specific preconditions, triggers or success/failure criteria of the internal agents. The DM uses two structures to traverse the tree:
- -
- A stack that assists the Depth First search by storing the dialogue flow (Figure 5);
- -
- A dashboard that stores information provided by the user or by external sources that is useful to keep the consistency of the dialogue.
4.2. Coaching Strategies for the Introductory Dialogue and the Nutrition Scenario
5. Design of the Dialogue Act
5.1. DA Intent
- Introduction. This block includes a unique label denoted as Int and covers all the sentences that only could be found in an introductory dialogue, such as “What is your name?” or “My name is Natalie. Nice to meet you”. Thus, in Table 1, we find this kind of sentence labelled as Int only in the Start of the conversation.
- Task independent. This block includes a unique label denoted as Gen. It encloses all the sentences that can appear in any context-like greetings, thanking or backchannels, among others. Table 1 shows that domain-independent sentences labelled as Gen may appear in any part of the dialogue.
5.2. Attributes
6. GROWsetta: Natural Language Generation for Coaching
6.1. Step One: Augmented Part-of-Speech Tagger Task over the NLU Entities
6.2. Step Two: Transforming Language-Independent Values
6.3. Step Three: Template Selection
6.4. Step Four: From Templates to Sentences
6.5. Step Five: Selecting the Best Sentence (Transformer Postprocessing)
7. Integration
7.1. Infrastructure Components
- Software Containers Framework. In our system, all the components are running within a Docker container [79] (orange in Figure 9), except for the components used as external services (ASR and TTS). Docker is a software platform that enables developers to create, test, and deploy applications within containers. A container is a packaging format that can encapsulate applications along with all their dependencies.
- Events and Message Broker. Roughly speaking, a message broker can be defined as a message transfer agent between different applications. For this system, Apache ActiveMQ [80] has been used because it is Open Source and allows communication between applications written in various programming/scripting languages.
- Web Applications/Sites Server. Apache Tomcat (http://tomcat.apache.org/, accessed on 20 January 2023), represented in light yellow in Figure 9, is the web server of the system. It allows hosting the web page to access the system. Indeed, it can implement complex applications and offer them as web services. In our case, such services are mainly aimed at the exchange of multimedia elements (user audio and video, VC audio) between user devices and the system server.
7.2. Application Components
7.3. Web Sites and Services
- User Interface. The user interface consists of two web pages (Figure 10). Figure 10a shows the page to choose the avatar of the VC the users want to interact with and asks users to provide some basic data (user name, gender). Then, users interact with the system through the second page, which consists of an avatar and some control buttons (Figure 10b). On the one hand, the avatar is a 3D character designed with CrazyTalk (https://www.reallusion.com/crazytalk/, accessed on 20 January 2023) and imported into Unity [81] with a set of body and facial animations. On the other hand, the control buttons let the users start or stop a session as well as record or write their turn. Finally, the user interface accesses the users’ cameras and microphones and sends the audio and video data to the server using the capabilities provided by browsers.
- Audio Interface. This service receives the audio data sent by the user interface. It has two main tasks. The first is to serve as an intermediary between the user interface, the speech recogniser service and the NLU. The other task consists of redirecting the audio data to the audio emotion detection component. In addition, it is responsible for storing the audio data with the transcriptions in the server for future use.
- Video Interface. The main objective of this service is to redirect video data to both the video emotion analysis and the biometric analysis components. In addition, it is also responsible for storing the videos of the sessions on the server for later use.
- Text to Speech Interface. The purpose of this service is to act as an intermediary between the NLG module, the text-to-speech service and the avatar. Whenever it receives a sentence to be uttered by the agent, this component sends it to the text-to-speech service, waits for the corresponding audio file, and then sends it to the avatar. It has to be noticed that this service also sends the associated text string as well as an associated animation name along with the audio file. This information is initially sent from the DM to the NLG, and from there to this component.
- Dialogue Manager Interface. In order to start and finish dialogue sessions the DM needs to receive some specific messages. This service is responsible for forwarding these messages from the user interface to the DM.
7.4. External Services
8. Results
8.1. Participants’ Profile
8.2. General Dialogue Statistics
8.3. Dialogue Flow
8.4. Task-Completion
8.5. NLG Performance
8.5.1. Offline Performance
8.5.2. Performance in the VC prototype
8.6. Human Acceptance
- VAAQ Pragmatic qualities: focus on the usefulness, usability, and accomplishment of the tasks of the proposed system, in this case, the GROW session;
- VAAQ Hedonic qualities (identity): related to the system’s personality;
- VAAQ Hedonic qualities (feelings): evaluate how captivating the system is, and how the users felt while conversing with it;
- VAAQ Attractiveness: assesses how tempting and attractive the interaction with the agent is;
- Intelligibility: evaluates the system’s output (generated language and voice).
9. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Preliminary GROWsetta vs. TGen Comparison
Appendix B. GROWsetta Postprocessing Results
Name | Nb. of Tuples | Nb. of Options | Brief Description | Correct Sentence Example | Incorrect Sentence Example |
---|---|---|---|---|---|
es_verb_time | 1000 | 2 | The verb has to match the adverbial of time. | ¿Y qué ha sucedido ayer? | ¿Y qué sucederá ayer? |
es_verb_num | 45 | 2 | The verb conjugation has to match the number of the subject. | ¿Cómo va a ser de momento el desayuno? | ¿Cómo van a ser de momento el desayuno? |
es_det | 250 | 4 | The determinant, if necessary, has to match the attribute. | ¿Qué vas a hacer con el vino? | ¿Qué vas a hacer con los vino? |
es_det_verb | 240 | 6 | es_verb_num and es_det tasks combined. | Así que me dices que te gusta la natación. | Así que me dices que te gustan los natación. |
es_food | 40 | 8 | es_det task with the additional condition that the selected verb makes sense with the attribute. | ¿Cuántas manzanas te gustaría comer? | ¿Cuánta manzanas te gustaría beber? |
fr_verb_time | 1400 | 2 | The verb has to match the adverbial of time. | Et qui était avec vous autrefois? | Et qui sera avec vous autrefois? |
fr_verb_num | 120 | 2 | The verb conjugation has to match the number of the subject. | Que vous ont apporté les vins? | Que vous a apporté les vins? |
fr_det_pron | 1640 | 8 | The determinant and pronoun, if necessary, have to match the attribute. | Dans quelle mesure ce deuxième plat vous rapproche-t-il pour atteindre votre objectif? | Dans quelle mesure les ce deuxième plat vous rapprochent-elles pour atteindre votre objectif? |
fr_food | 567 | 4 | Distinguish between countable and uncountable food names. | Quelle quantité de sucre? | Combien de sucre? |
no_verb_prep | 104 | 4 | The attribute has to fit with the verb and the preposition. Its placement has to be correct as well. | Ønsker du å spise nå? | Ønsket du å spise i nå? |
Accuracy | N-Grams (N = 3) | GPT-2 1 Epoch | GPT-2 2 Epochs |
---|---|---|---|
es_verb_time | 52.86 | 63.93 | 80.75 |
es_verb_num | 55.56 | 77.78 | 86.87 |
es_det | 26.55 | 49.09 | 96.00 |
es_det_verb | 29.26 | 60.37 | 98.15 |
es_food | 30.00 | 10.00 | 60.00 |
fr_verb_time | 59.69 | 58.44 | 68.94 |
fr_verb_num | 50.00 | 64.17 | 82.50 |
fr_det_pron | 12.50 | 36.85 | 44.80 |
fr_food | 36.79 | 39.26 | 53.33 |
no_verb_prep | 26.47 | 76.94 | 73.86 |
References
- Zorrilla, A.L.; de Velasco Vázquez, M.; Manso, J.I.; Fernández, J.M.O.; Blanco, R.J.; Bara, M.I.T. EMPATHIC: Empathic, Expressive, Advanced Virtual Coach to Improve Independent Healthy-Life-Years of the Elderly. Proces. Leng. Nat. 2018, 61, 167–170. [Google Scholar]
- Torres, M.I.; Olaso, J.M.; Montenegro, C.; Santana, R.; Vázquez, A.; Justo, R.; Lozano, J.A.; Schlögl, S.; Chollet, G.; Dugan, N.; et al. The EMPATHIC Project: Mid-Term Achievements. In Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Rhodes, Greece, 5–7 June 2019; Association for Computing Machinery: New York, NY, USA, 2019. PETRA ’19. pp. 629–638. [Google Scholar] [CrossRef]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi Speech Recognition Toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, Waikoloa, HI, USA, 11–15 December 2011. IEEE Catalog No.: CFP11SRW-USB. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Bahdanau, D.; Cho, K.H.; Bengio, Y. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Young, S.J. Probabilistic methods in spoken–dialogue systems. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 2000, 358, 1389–1402. [Google Scholar] [CrossRef]
- Levin, E.; Pieraccini, R.; Eckert, W. A stochastic model of human–machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process. 2000, 8, 11–23. [Google Scholar] [CrossRef] [Green Version]
- Torres, M.I. Stochastic Bi-Languages to model Dialogs. In Proceedings of the 11th International Conference on Finite State Methods and Natural Language Processing, Association for Computational Linguistics, St Andrews, Scotland, 15–17 July 2013; pp. 9–17. [Google Scholar]
- Serras, M.; Torres, M.I.; del Pozo, A. User-aware dialogue management policies over attributed bi-automata. Pattern Anal. Appl. 2019, 22, 1319–1330. [Google Scholar] [CrossRef] [Green Version]
- Young, S. Using POMDPs for dialog management. In Proceedings of the 2006 IEEE Spoken Language Technology Workshop, Palm Beach, FL, USA, 10–13 December 2006; pp. 8–13. [Google Scholar]
- Young, S.; Gašić, M.; Thomson, B.; Williams, J.D. POMDP-Based Statistical Spoken Dialog Systems: A Review. Proc. IEEE 2013, 101, 1160–1179. [Google Scholar] [CrossRef]
- Adiwardana, D.; Luong, M.T.; Thus, D.R.; Hall, J.; Fiedel, N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Lu, Y.; et al. Towards a human-like open-domain chatbot. arXiv 2020, arXiv:2001.09977. [Google Scholar]
- Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, Y.; Xu, J.; Ott, M.; Smith, E.M.; Boureau, Y.L.; et al. Recipes for Building an Open-Domain Chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Kyiv, Ukraine, 19–23 April 2021; pp. 300–325. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. arXiv 2022, arXiv:2203.02155. [Google Scholar]
- Wolf, T.; Sanh, V.; Chaumond, J.; Delangue, C. Transfertransfo: A transfer learning approach for neural network based conversational agents. arXiv 2019, arXiv:1901.08149. [Google Scholar]
- López Zorrilla, A.; Torres, M.I.; Cuayáhuitl, H. Audio Embedding-Aware Dialogue Policy Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 31, 525–538. [Google Scholar] [CrossRef]
- Verma, S.; Fu, J.; Yang, M.; Levine, S. CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement Learning. arXiv 2022, arXiv:2204.08426. [Google Scholar]
- Saha, T.; Chopra, S.; Saha, S.; Bhattacharyya, P.; Kumar, P. A large-scale dataset for motivational dialogue system: An application of natural language generation to mental health. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
- López Zorrilla, A.; Torres, M.I. A multilingual neural coaching model with enhanced long-term dialogue structure. ACM Trans. Interact. Intell. Syst. 2022, 12, 1–47. [Google Scholar] [CrossRef]
- Dušek, O.; Novikova, J.; Rieser, V. Findings of the E2E NLG Challenge. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg, The Netherlands, 5–8 November 2018; pp. 322–328. [Google Scholar]
- Balakrishnan, A.; Rao, J.; Upasani, K.; White, M.; Subba, R. Constrained Decoding for Neural NLG from Compositional Representations in Task-Oriented Dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 831–844. [Google Scholar]
- Mann, G.; Kishore, B.; Dhillon, P. A natural language generation technique for automated psychotherapy. Lect. Notes Comput. Sci. 2021, 12640, 33–41. [Google Scholar]
- Fadhil, A.; Schiavo, G.; Wang, Y. CoachAI: A Conversational Agent Assisted Health Coaching Platform. arXiv 2019, arXiv:1904.11961. [Google Scholar]
- Mohan, S.; Venkatakrishnan, A.; Hartzler, A.L. Designing an AI Health Coach and Studying Its Utility in Promoting Regular Aerobic Exercise. ACM Trans. Interact. Intell. Syst. 2020, 10, 1–30. [Google Scholar] [CrossRef]
- Beun, R.J.; Fitrianie, S.; Griffioen-Both, F.; Spruit, S.; Horsch, C.; Lancee, J.; Brinkman, W.P. Talk and Tools: The best of both worlds in mobile user interfaces for E-coaching. Pers. Ubiquitous Comput. 2017, 21, 661–674. [Google Scholar] [CrossRef] [Green Version]
- Abdulrahman, A.; Richards, D.; Bilgin, A.A. Changing users’ health behaviour intentions through an embodied conversational agent delivering explanations based on users’ beliefs and goals. Behav. Inf. Technol. 2022, 1–19. [Google Scholar] [CrossRef]
- Laranjo, L.; Dunn, A.G.; Tong, H.L.; Kocaballi, A.B.; Chen, J.; Bashir, R.; Surian, D.; Gallego, B.; Magrabi, F.; Lau, A.Y.; et al. Conversational agents in healthcare: A systematic review. J. Am. Med. Inform. Assoc. 2018, 25, 1248–1258. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schachner, T.; Keller, R.; von Wangenheim, F. Artificial Intelligence-Based Conversational Agents for Chronic Conditions: Systematic Literature Review. J. Med. Internet Res. 2020, 22, e20701. [Google Scholar] [CrossRef]
- Ruggiano, N.; Brown, E.L.; Roberts, L.; Suarez, C.V.F.; Luo, Y.; Hao, Z.; Hristidis, V. Chatbots to Support People with Dementia and Their Caregivers: Systematic Review of Functions and Quality. J. Med. Internet Res. 2021, 23, e25006. [Google Scholar] [CrossRef]
- Fitzpatrick, K.K.; Darcy, A.; Vierhile, M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Ment. Health 2017, 4, e7785. [Google Scholar] [CrossRef]
- Fulmer, R.; Joerin, A.; Gentile, B.; Lakerink, L.; Rauws, M. Using psychological artificial intelligence (Tess) to relieve symptoms of depression and anxiety: Randomized controlled trial. JMIR Ment. Health 2018, 5, e64. [Google Scholar] [CrossRef]
- Tanaka, H.; Negoro, H.; Iwasaka, H.; Nakamura, S. Embodied conversational agents for multimodal automated social skills training in people with autism spectrum disorders. PLoS ONE 2017, 12, e0182151. [Google Scholar] [CrossRef] [Green Version]
- Easton, K.; Potter, S.; Bec, R.; Bennion, M.; Christensen, H.; Grindell, C.; Mirheidari, B.; Weich, S.; de Witte, L.; Wolstenholme, D.; et al. A virtual agent to support individuals living with physical and mental comorbidities: Co-design and acceptability testing. J. Med. Internet Res. 2019, 21, e12996. [Google Scholar] [CrossRef]
- Rose-Davis, B.; Van Woensel, W.; Stringer, E.; Abidi, S.; Abidi, S.S.R. Using an artificial intelligence-based argument theory to generate automated patient education dialogues for families of children with juvenile idiopathic arthritis. In MEDINFO 2019: Health and Wellbeing e-Networks for All; IOS Press: Amsterdam, The Netherlands, 2019; pp. 1337–1341. [Google Scholar]
- Brissos, V.; Santos, C.; Santos, J.M.; Guerreiro, M.P. The VASelfCare T2D project plan: Fostering innovation through the StartUp Research program. Procedia Comput. Sci. 2021, 181, 876–881. [Google Scholar] [CrossRef]
- Finzel, R.; Gaydhani, A.; Dufresne, S.; Gini, M.; Pakhomov, S. Conversational Agent for Daily Living Assessment Coaching Demo. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, Kyiv, Ukraine, 19–23 April 2021; pp. 321–328. [Google Scholar]
- Kimani, E.; Bickmore, T.; Trinh, H.; Pedrelli, P. You’ll be great: Virtual agent-based cognitive restructuring to reduce public speaking anxiety. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; pp. 641–647. [Google Scholar]
- Gaydhani, A.; Finzel, R.; Dufresne, S.; Gini, M.; Pakhomov, S. Conversational Agent for Daily Living Assessment Coaching. In Proceedings of the CEUR Workshop Proceedings. CEUR-WS, Chennai, India, 22–24 April 2020; Volume 2760, pp. 8–13. [Google Scholar]
- Inkster, B.; Sarda, S.; Subramanian, V. An empathy-driven, conversational artificial intelligence agent (Wysa) for digital mental well-being: Real-world data evaluation mixed-methods study. JMIR MHealth UHealth 2018, 6, e12106. [Google Scholar] [CrossRef] [Green Version]
- Sinha, C.; Cheng, A.L.; Kadaba, M. Adherence and Engagement with a Cognitive Behavioral Therapy Based Conversational Agent (Wysa) in Adults with Chronic Pain: Survival Analysis. JMIR Form. Res. 2022, 6, e37302. [Google Scholar] [CrossRef]
- Anastasiadou, M.; Alexiadis, A.; Polychronidou, E.; Votis, K.; Tzovaras, D. A prototype educational virtual assistant for diabetes management. In Proceedings of the 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE), Cincinnati, OH, USA, 26–28 October 2020; pp. 999–1004. [Google Scholar]
- Rehman, U.U.; Chang, D.J.; Jung, Y.; Akhtar, U.; Razzaq, M.A.; Lee, S. Medical instructed real-time assistant for patient with glaucoma and diabetic conditions. Appl. Sci. 2020, 10, 2216. [Google Scholar] [CrossRef] [Green Version]
- Nguyen, T.T.; Sim, K.; Kuen, A.T.Y.; O’donnell, R.R.; Lim, S.T.; Wang, W.; Nguyen, H.D. Designing AI-based Conversational Agent for Diabetes Care in a Multilingual Context. arXiv 2021, arXiv:2105.09490. [Google Scholar]
- van Waterschoot, J.; Hendrickx, I.; Khan, M.A.; Klabbers, E.; de Korte, M.; Strik, H.; Cucchiarini, C.; Theune, M. BLISS: An Agent for Collecting Spoken Dialogue data about Health and Well-being. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 449–458. [Google Scholar]
- van Waterschoot, J.; Bruijnes, M.; Flokstra, J.; Reidsma, D.; Davison, D.; Theune, M.; Heylen, D. Flipper 2.0: A Pragmatic Dialogue Engine for Embodied Conversational Agents. In Proceedings of the 18th International Conference on Intelligent Virtual Agents, Association for Computing Machinery. Sidney, Australia, 5–8 November 2018; IVA’18. pp. 43–50. [Google Scholar] [CrossRef] [Green Version]
- Ireland, D.; Atay, C.; Liddle, J.J.; Bradford, D.; Lee, H.; Rushin, O.; Mullins, T.; Angus, D.; Wiles, J.; McBride, S.; et al. Hello Harlie: Enabling speech monitoring through chat-bot conversations. Stud. Health Technol. Inform. 2016, 227, 55–60. [Google Scholar]
- Montenegro, C.; López Zorrilla, A.; Mikel Olaso, J.; Santana, R.; Justo, R.; Lozano, J.A.; Torres, M.I. A Dialogue-Act Taxonomy for a Virtual Coach Designed to Improve the Life of Elderly. Multimodal Technol. Interact. 2019, 3, 52. [Google Scholar] [CrossRef] [Green Version]
- Justo, R.; Ben Letaifa, L.; Palmero, C.; Gonzalez-Fraile, E.; Johansen, T.; Annaand Vázquez, A.; Cordasco, G.; Schlögl, S.; Fernández-Ruanova, B.; Silva, M.; et al. Analysis of the Interaction between Elderly People and a Simulated Virtual Coach. J. Ambient Intell. Humaniz. Comput. 2020, 11, 6125–6140. [Google Scholar] [CrossRef]
- Montenegro, C.; Santana, R.; Lozano, J.A. Analysis of the sensitivity of the End-Of-Turn Detection task to errors generated by the Automatic Speech Recognition process. Eng. Appl. Artif. Intell. 2021, 100, 104189. [Google Scholar] [CrossRef]
- López Zorrilla, A.; Dugan, N.; Torres, M.I.; Glackin, C.; Chollet, G.; Cannings, N. Some ASR experiments using deep neural networks on Spanish databases. In Advances in Speech and Language Technologies for Iberian Languages; IberSPEECH: Lisbon, Portugal, 2016. [Google Scholar]
- Huang, J.; Tao, J.; Liu, B.; Lian, Z.; Niu, M. Multimodal Transformer Fusion for Continuous Emotion Recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3507–3511. [Google Scholar] [CrossRef]
- Letaifa, L.B.; Torres, M.I. Perceptual Borderline for Balancing Multi-Class Spontaneous Emotional Data. IEEE Access 2021, 9, 55939–55954. [Google Scholar] [CrossRef]
- Greco, C.; Buono, C.; Buch-Cardona, P.; Cordasco, G.; Escalera, S.; Esposito, A.; Fernandez, A.; Kyslitska, D.; Kornes, M.S.; Palmero, C.; et al. Emotional Features of Interactions with Empathic Agents. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 19–25 June 2021; pp. 2168–2176. [Google Scholar]
- deVelasco Vázquez, M.; Justo, R.; Torres, M.I. Automatic Identification of Emotional Information in Spanish TV Debates and Human-Machine Interactions. Appl. Sci. 2022, 12, 1902. [Google Scholar] [CrossRef]
- Nasri, M.; Hmani, M.; Mtibaa, A.; Petrovska-Delacrétaz, D.; Slima, M.; Hamida, A. Face Emotion Recognition From Static Image Based on Convolution Neural Networks. In Proceedings of the 2020 5th International Conference on Advanced Technologies for Signal and Image Processing, Sfax, Tunisia, 2–5 September 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Palmero, C.; Selva, J.; Bagheri, M.; Escalera, S. Recurrent cnn for 3d gaze estimation using appearance and shape cues. arXiv 2018, arXiv:1805.03064. [Google Scholar]
- ASESCO. Spanish Asociation of Coaching. Available online: http://www.asescoaching.org/ (accessed on 2 November 2021).
- Alexander, G. Behavioural coaching-the GROW model. In Excellence in Coaching: The Industry Guide; Jonathan Passmore, A.F.C., Ed.; Kogan Page: London, UK, 2010; pp. 83–93. [Google Scholar]
- Whitmore, S.J. Coaching for Performance-Growing Human Potential and Purpose. The Principles and Practice of Coaching and Leadership, 4th ed.; Nicholas Brealey Publishing: London, UK, 2009. [Google Scholar]
- Sayas, S. Dialogues on Nutrition. Technical Report DP1. Sayasalud and Empathic Project. 2018; in press. [Google Scholar]
- Sayas, S. Dialogues on Physical Exercise. Technical Report DP2. Sayasalud and Empathic Project. 2018; in press. [Google Scholar]
- Sayas, S. Dialogues on Leisure and Free Time. Technical Report DP3. Sayasalud and Empathic Project. 2018; in press. [Google Scholar]
- Bohus, D.; Rudnicky, A.I. The RavenClaw dialog management framework: Architecture and systems. Comput. Speech Lang. 2009, 23, 332–361. [Google Scholar] [CrossRef]
- Raux, A.; Langner, B.; Bohus, D.; Black, A.W.; Eskenazi, M. Let’s go public! taking a spoken dialog system to the real world. In Proceedings of the Interspeech 2005, Citeseer, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
- Ghigi, F.; Eskenazi, M.; Torres, M.I.; Lee, S. Incremental dialog processing in a task-oriented dialog. In Proceedings of the Fifteenth Annual Conference of the International Speech Communication Association, Interspeech, Singapore, 14–18 September 2014. [Google Scholar]
- Olaso, J.M.; Torres, M.I. User Experience Evaluation of a Conversational Bus Information System in Spanish. In Proceedings of the 8th IEEE International Conference on Cognitive InfoCommunications, Debrecen, Hungary, 11–14 September 2017. [Google Scholar]
- Olaso, J.M.; Milhorat, P.; Himmelsbach, J.; Boudy, J.; Chollet, G.; Schlögl, S.; Torres, M.I. A Multi-lingual Evaluation of the vAssist Spoken Dialog System. Comparing Disco and RavenClaw. In Dialogues with Social Robots: Enablements, Analyses, and Evaluation; Springer: Singapore, 2016; pp. 221–232. [Google Scholar] [CrossRef]
- Olaso, J.M.; Vázquez, A.; Ben Letaifa, L.; de Velasco, M.; Mtibaa, A.; Hmani, M.A.; Petrovska-Delacrétaz, D.; Chollet, G.; Montenegro, C.; López-Zorrilla, A.; et al. The EMPATHIC Virtual Coach: A Demo. In Proceedings of the 2021 International Conference on Multimodal Interaction, Montreal, QC, Canada, 18–22 October 2021; ICMI ’21. pp. 848–851. [Google Scholar]
- Justo, R.; Letaifa, L.B.; Olaso, J.M.; López-Zorrilla, A.; Develasco, M.; Vázquez, A.; Torres, M.I. A Spanish Corpus for Talking to the Elderly. In Conversational Dialogue Systems for the Next Decade; D’Haro, L.F., Callejas, Z., Nakamura, S., Eds.; Springer: Singapore, 2021; Volume 704, Lecture Notes in Electrical Engineering; pp. 183–192. [Google Scholar]
- Stolcke, A.; Ries, K.; Coccaro, N.; Shriberg, E.; Bates, R.; Jurafsky, D.; Taylor, P.; Martin, R.; Ess-Dykema, C.V.; Meteer, M. Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. Linguist. 2000, 26, 339–373. [Google Scholar] [CrossRef] [Green Version]
- Bunt, H. The DIT++ taxonomy for functional dialogue markup. In Proceedings of the AAMAS 2009 Workshop, Towards a Standard Markup Language for Embodied Dialogue Acts, Budapest, Hungary, 10–15 May 2009; pp. 13–24. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Lison, P.; Tiedemann, J. Opensubtitles 2016: Extracting large parallel corpora from movie and tv subtitles. In Proceedings of the 10th Language Resources and Evaluation Conference, Portorož, Slovenia, 23–28 May 2016; pp. 923–929. [Google Scholar]
- Ortiz Suárez, P.J.; Romary, L.; Sagot, B. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 5–10 July 2020; pp. 1703–1714. [Google Scholar]
- Merkel, D. Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014, 2014, 2. [Google Scholar]
- Bruce Snyder and Dejan Bosanac and Rob Davies. ActiveMQ in Action; Manning Publications: Shelter Island, NY, USA, 2010. [Google Scholar]
- Haas, J.K. A History of the Unity Game Engine; Worcester Polytechnic Institute: Worcester, MA, USA, 2014. [Google Scholar]
- World Health Organization. WHOQOL-BREF: Introduction, Administration, Scoring and Generic Version of the Assessment: Field Trial Version, December 1996; Technical Report; World Health Organization: Geneva, Switzerland, 1996.
- Sheikh, J.I.; Yesavage, J.A. Geriatric Depression Scale (GDS): Recent evidence and development of a shorter version. Clin. Gerontol. J. Aging Ment. Health 1986, 5, 165–173. [Google Scholar]
- Ghandeharioun, A.; Shen, J.H.; Jaques, N.; Ferguson, C.; Jones, N.; Lapedriza, A.; Picard, R. Approximating interactive human evaluation with self-play for open-domain dialog systems. Adv. Neural Inf. Process. Syst. 2019, 32, 13665–13676. [Google Scholar]
- Esposito, A.; Amorese, T.; Cuciniello, M.; Esposito, A.M.; Troncone, A.; Torres, M.I.; Schlögl, S.; Cordasco, G. Seniors’ acceptance of virtual humanoid agents. In Proceedings of the Italian Forum of Ambient Assisted Living, Lecce Italy, 2–4 July 2018; pp. 429–443. [Google Scholar]
- Tainta, M.; Olaso, J.M.; Torres, M.I.; Ecay-Torres, M.; Balluerka, N.; Ros, N.; Izquierdo, M.; Saéz de Asteasu, M.; Etxebarria, U.; Gayoso, L.; et al. The CITA GO-ON trial: A person-centered, digital, intergenerational, and cost-effective dementia prevention multi-modal intervention model to guide strategic policies facing the demographic challenges of progressive aging. In Proceedings of the Iberspeech, Granada, Spain, 14–16 November 2022. [Google Scholar]
- Gonzalez-Fraile, E.; Gonzalez-Pinto, A.; Tenorio-Laranga, J.; Fernández-Ruanova, B.; Olaso, J.M.; Montenegro, C.; Santana, R.; Vázquez, A.; Justo, R.; Lozano, J.A.; et al. Empathic, expressive, advanced virtual coach to improve independent healthy-life-years of the elderly (the empathic project: Mid-term achievements). Eur. Psychiatry 2020, 63, S384. [Google Scholar]
- Dušek, O.; Jurcicek, F. Sequence-to-Sequence Generation for Spoken Dialogue via Deep Syntax Trees and Strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 45–51. [Google Scholar]
- Dušek, O.; Jurcicek, F. Neural Generation for Czech: Data and Baselines. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October–1 November 2019; pp. 563–574. [Google Scholar]
Turn | Dialogue Act | Sentence |
---|---|---|
START OF THE CONVERSATION | ||
Coach | Gen&Hello() | Hello! |
Gen&Inform(<<biometry>>=no_user) | I need you to locate in front of the camera to create your profile. If not, I can’t go on. | |
Gen&Inform(<<biometry>>=accepted_first) | I have already created your profile. | |
Int&self_introduction(<agent_name>=Natalie) | My name is Natalie. Nice to meet you. | |
Int&what_name() | What is your name? | |
User | Juanjo NLU detected entities: {propernouns:[Juanjo]} | |
Coach | Int&what_name(<<confirm_name>>=True, <user_name>=Juanjo) | Thus, your name is Juanjo. Is this correct? |
User | yes NLU detected entities:{} | |
MID-CONVERSATION (GROW IN PROGRESS) | ||
Coach | Gen&Pos_opinion() | Great! |
Gen&Open_topic(<<eating_routine_in_progress>>=False, <<eat_well>>=True) | It seems that you eat healthy but I would like to know the details of your nutrition routine to see if it can be improved. | |
Gen&Open_topic(<topic>=breakfast) | Let’s focus on your breakfast. | |
RQ&curr_sit(<<topic>>=breakfast) | Tell me what do you usually have for breakfast. | |
User | my usual breakfast consists of an apple and ....then a toast with oil NLU detected entities: {food:[apple, toast, oil]} | |
Coach | RQ&curr_sit(<<echo>>=True,<<positive>=True, <action>=eat,<food>=apple_toast_oil) | Thus, you eat apples and toast and you drink oil. |
RQ&curr_sit(<<first>>=True,<<topic>>=breakfast, <food>=apple_toast_oil) | Do you think that having apples, toast and oil for breakfast gives you energy for all day? | |
User | I think so NLU detected entities: {} | |
END OF THE CONVERSATION | ||
Coach | RQ&curr_sit(<<echo>>=True,<days_per_week>=2, <user_name>=Juanjo) | Juanjo, two per week? |
Gen&Pos_feedback(<<confirm>>=True) | Is this correct? | |
User | yes NLU detected entities: {} | |
Coach | Gen&Pos_opinion() | Perfect! |
PAQ&what_plan(<<regular_time>>=True) | It’s important to define regular timetables to improve your regularity. Will you do it? | |
User | yes NLU detected entities: {} | |
Coach | Gen&Pos_opinion() | I like that. |
Gen&Close_topic(<topic>=nutrition | Nutrition session has finished. | |
GSQ&what_obj(<<summary>>=True,<action>=eat_regular) | The summary of the session is as follows. You have decided to improve your regularity. | |
PAQ&what_plan(<<hours>>=True,<<action>>=eat_regular, <freq>=2,<food>=fish) | In addition, your first step to improve it is to eat fish twice a week defining a regular timetables for all the meals. | |
Gen&Goodbye(<user_name>=Juanjo) | See you, Juanjo! |
Language | Agent Female User Male | Agent Female User Female | Agent Male User Male | Agent Male User Female |
---|---|---|---|---|
Spanish | Estoy cansada y veo que tú también estás cansado. | Estoy cansada y veo que tú también estás cansada. | Estoy cansado y veo que tú también estás cansado. | Estoy cansado y veo que tú también estás cansada. |
Norwegian | Jeg er sliten og jeg ser at du er sliten. | |||
French | Je suis fatiguée et je vois que tu es fatigué. | Je suis fatiguée et je vois que tu es fatiguée. | Je suis fatigué et je vois que tu es fatigué. | Je suis fatigué et je vois que tu es fatiguée. |
<action>=enjoy_leisure | |||
---|---|---|---|
Version | Spanish | French | Norwegian |
Infinitive | disfrutar del tiempo libre | profiter du temps libre | nyte fritiden |
Present simple | aprovechas tu tiempo libre | vous vous amusez pendant votre temps libre | nyter fritiden |
Antonym infinitive | aburrirte en tu tiempo libre | ne pas profiter du temps libre | ikke nyte fritiden |
Antonym present simple | te aburres en el tiempo libre | vous ne profitez pas du temps libre | nyter ikke fritiden |
Value Attribute | Templates | Generated Sentences | Correct |
---|---|---|---|
<dates>=yesterday | What are you going to do <dates>? | What are you going to do yesterday? | No |
What did you do <dates>? | What did you do yesterday? | Yes | |
<dates>=Mondays | What are you going to do <dates>? | What are you going to do Mondays? | No |
What are you going to do on <dates>? | What are you going to do on Mondays? | Yes |
Spanish | French | Norwegian | |
---|---|---|---|
Amount of raw text | 10 GB | 7 GB | 5 GB (1 GB) |
Number of sentences | 230 M | 121 M | 30 M (14 M) |
Running words | 1.7 B | 1.3 B | 750 M (150 M) |
Spain | France | Norway | |
---|---|---|---|
Nb. of participants | 31 | 22 | 26 |
Avg. age | 71.6 | 68.4 | 73.4 |
Female participants | 44.8% | 53.3% | 30.8% |
Avg. quality of life | 68.6 | 65.1 | 75.2 |
Avg. GDS depression level | 4.2 | 6.8 | 3.7 |
Avg. ease of use of PCs | 82.7 | 79.4 | 93.3 |
WoZ Experiments | VC Prototype | Data-Driven Bot | |
---|---|---|---|
Spanish | 12.9 * | 9.5 | 3.6 * |
French | 18.2 * | 7.6 | 3.4 * |
Norwegian | 17.9 * | 5.4 | 5.3 |
Spanish | French | Norwegian |
---|---|---|
4.23 ± 0.80 | 3.77 ± 0.98 | 4.82 ± 1.21 |
Name | Nb. of Tuples | Nb. of Options | Brief Description | Correct Sentence Example | Incorrect Sentence Example |
---|---|---|---|---|---|
es_verb_time | 1000 | 2 | The verb has to match the adverbial of time. | ¿Y qué ha sucedido ayer? | ¿Y qué sucederá ayer? |
fr_verb_num | 120 | 2 | The verb conjugation has to match the number of the subject. | Que vous ont apporté les vins? | Que vous a apporté les vins? |
no_verb_prep | 104 | 4 | The attribute has to fit with the verb and the preposition. Its placement has to be correct as well. | Ønsker du å spise nå? | Ønsket du å spise i nå? |
Accuracy | N-Grams (N = 3) | GPT-2 1 Epoch | GPT-2 2 Epochs |
---|---|---|---|
es | 38.85 | 52.23 | 84.35 |
fr | 39.75 | 49.68 | 62.39 |
no | 26.47 | 76.94 | 73.86 |
Spanish | French | Norwegian |
---|---|---|
3.4% | 3.5% | 5.0% |
WoZ/VC Proto. | Pragmatic Qualities | Hedonic Qualities (Identity) | Hedonic Qualities (Feelings) | Attractiveness | Intelligibility |
---|---|---|---|---|---|
Spanish | 63.0/58.5 | 71.7 */65.4 | 62.5 */52.5 | 64.7/61.4 | 71.0/63.6 |
French | 60.8/51.8 | 77.0/71.4 | 64.4 */45.7 | 66.9/61.8 | 62.5/67.8 |
Norwegian | 57.2/47.5 | 70.9/67.0 | 56.9/48.9 | 57.3/50.8 | 64.9/63.6 |
WoZ/VC Proto. | Spanish | French | Norwegian |
---|---|---|---|
I think that communicating with the agent is simple and easy. | 72.3/66.7 | 53.1/57.6 | 66.0/53.8 |
I think that communicating with the agent is useless. † | 70.0/66.7 | 67.9/62.0 | 63.3/54.8 |
I think the agent is very human. | 48.7 */34.2 | 57.7 */40.2 | 48.9/46.2 |
I think that communicating with the agent is enjoyable. | 54.7/63.3 | 60.3/63.0 | 45.1/38.5 |
I think that communicating with the agent is engaging. | 69.6 */56.9 | 72.3 */48.9 | 60.1/52.9 |
I think that communicating with the agent is stressful. † | 76.0/78.3 | 85.3/87.0 | 67.1/55.8 |
The agent can be easily understood. | 88.0/82.5 | 75.0/82.6 | 82.3/82.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Vázquez, A.; López Zorrilla, A.; Olaso, J.M.; Torres, M.I. Dialogue Management and Language Generation for a Robust Conversational Virtual Coach: Validation and User Study. Sensors 2023, 23, 1423. https://doi.org/10.3390/s23031423
Vázquez A, López Zorrilla A, Olaso JM, Torres MI. Dialogue Management and Language Generation for a Robust Conversational Virtual Coach: Validation and User Study. Sensors. 2023; 23(3):1423. https://doi.org/10.3390/s23031423
Chicago/Turabian StyleVázquez, Alain, Asier López Zorrilla, Javier Mikel Olaso, and María Inés Torres. 2023. "Dialogue Management and Language Generation for a Robust Conversational Virtual Coach: Validation and User Study" Sensors 23, no. 3: 1423. https://doi.org/10.3390/s23031423
APA StyleVázquez, A., López Zorrilla, A., Olaso, J. M., & Torres, M. I. (2023). Dialogue Management and Language Generation for a Robust Conversational Virtual Coach: Validation and User Study. Sensors, 23(3), 1423. https://doi.org/10.3390/s23031423