Cooperative Multi-Agent Reinforcement Learning with Conversation Knowledge for Dialogue Management
Abstract
:1. Introduction
- We propose an MADM to optimize the cooperative policies between an end-to-end dialogue manager and a user simulator concurrently from scratch.
- We apply reward shaping technique based on adjacency pairs to user simulator to speed learning and to help the MADM generate normal human-human conversation.
- We further generalize the one-to-one learning strategy to one-to-many learning strategy to improve the performance for trained dialogue manager.
2. Related Work
3. Model
3.1. Notation
3.2. Multi-Agent Dialogue Model (MADM)
3.2.1. Dialogue Manager
3.2.2. User Simulator
3.3. Cooperative Training
- Manager reward and simulator reward are both , if is a successful completed state.
- Manager reward and simulator reward are both , if is not a successful completed state until the maximum length T in a dialogue.
- Manager reward and simulator reward are both in otherwise.
- Simulator reward is , if is a non-terminal state and the action pair does not belong to the set of adjacency pairs.
- Simulator reward is , if is a non-terminal state and the action pair does not belong to the set of adjacency pairs, where is the shaping reward greater than .
- Manager reward and simulator reward are both , if is a successful completed state.
- Manager reward and simulator reward are both , if is not a successful completed state until the dialogue reaches maximum length T in a dialogue.
- Manager reward is , if is a non-terminal state.
4. Experiment
4.1. Dataset
4.2. Cross-Model Evaluation with Human Users Involved
4.2.1. Users for Cross-Model Evaluation
4.2.2. Dialogue Managers for Cross-Model Evaluation
4.2.3. Results
4.2.4. Good Case Study
4.3. Ablation
4.3.1. Adjacency Pair Performance
4.3.2. Comparison of Various Simualtors Settings in One-to-Many Learning
4.3.3. One-to-One Learning vs. One-to-Many Learning
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
NLU | Natural Language Understanding |
DM | Dialogue Management |
NLG | Natural Language Generation |
DRL | Deep Reinforcement Learning |
MADM | Multi-Agent Dialogue Model |
HRNN | Hierarchical Recurrent Neural Network |
LSTM | Long Short-Term Memory |
DNN | Deep Neural Network |
MLP | Multi-Layer Perceptron |
ST | Success Rate |
AT | Average Turns |
References
- Williams, J.D.; Young, S. Scaling POMDPs for spoken dialog management. TASLP 2007, 15, 2116–2129. [Google Scholar] [CrossRef] [Green Version]
- Young, S.; Gasic, M.; Thomson, B.; Williams, J.D. POMDP-Based Statistical Spoken Dialog Systems: A Review. Proc. IEEE 2013, 5, 1160–1179. [Google Scholar] [CrossRef]
- Gašić, M.; Breslin, C.; Henderson, M.; Kim, D.; Szummer, M.; Thomson, B.; Tsiakoulis, P.; Young, S. On-line policy optimisation of bayesian spoken dialogue systems via human interaction. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8367–8371. [Google Scholar]
- Fatemi, M.; Asri, L.E.; Schulz, H.; He, J.; Suleman, K. Policy networks with two-stage training for dialogue systems. arXiv 2016, arXiv:1606.03152. [Google Scholar]
- Su, P.H.; Budzianowski, P.; Ultes, S.; Gasic, M.; Young, S. Sample-efficient Actor-Critic Reinforcement Learning with Supervised Data for Dialogue Management. arXiv 2017, arXiv:1707.00130. [Google Scholar]
- Casanueva, I.; Budzianowski, P.; Su, P.H.; Mrkšić, N.; Wen, T.H.; Ultes, S.; Rojas-Barahona, L.; Young, S.; Gašić, M. A benchmarking environment for reinforcement learning based task oriented dialogue management. arXiv 2017, arXiv:1711.11023. [Google Scholar]
- Weisz, G.; Budzianowski, P.; Su, P.H.; Gasic, M. Sample Efficient Deep Reinforcement Learning for Dialogue Systems With Large Action Spaces. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 2083–2097. [Google Scholar] [CrossRef]
- Peng, B.; Li, X.; Gao, J.; Liu, J.; Chen, Y.N.; Wong, K.F. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6149–6153. [Google Scholar]
- Peng, B.; Li, X.; Gao, J.; Liu, J.; Wong, K.F. Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning. arXiv 2018, arXiv:1801.06176. [Google Scholar]
- Liu, B.; Lane, I. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; pp. 482–489. [Google Scholar]
- Ng, A.Y.; Harada, D.; Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML), Bled, Slovenia, 27–30 June 1999; pp. 278–287. [Google Scholar]
- Liddicoat, A.J. Adjacency pairs. In An Introduction to Conversation Analysis; Bloomsbury Publishing: London, UK, 2011; pp. 143–145. [Google Scholar]
- Zhao, T.; Eskenazi, M. Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning. arXiv 2016, arXiv:1606.02560. [Google Scholar]
- Williams, J.D.; Atui, K.A.; Zweig, G. Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning. arXiv 2017, arXiv:1702.03274. [Google Scholar]
- Dhingra, B.; Li, L.; Li, X.; Gao, J.; Chen, Y.N.; Ahmad, F.; Deng, L. Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access. arXiv 2017, arXiv:1609.00777. [Google Scholar]
- Yang, X.; Chen, Y.N.; Hakkani-Tür, D.; Crook, P.; Li, X.; Gao, J.; Deng, L. End-to-end joint learning of natural language understanding and dialogue manager. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5690–5694. [Google Scholar]
- Pietquin, O.; Geist, M.; Chandramohan, S. Sample efficient on-line learning of optimal dialogue policies with kalman temporal differences. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 1878–1883. [Google Scholar]
- Scheffler, K.; Young, S. Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning. In Proceedings of the Second International Conference on Human Language Technology Research, San Diego, CA, USA, 24–27 March 2002; pp. 12–19. [Google Scholar]
- Cuayáhuitl, H.; Renals, S.; Lemon, O.; Shimodaira, H. Human-computer dialogue simulation using hidden markov models. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico, 27 November–1 December 2005; pp. 290–295. [Google Scholar]
- Pietquin, O.; Dutoit, T. A probabilistic framework for dialog simulation and optimal strategy learning. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 589–599. [Google Scholar] [CrossRef] [Green Version]
- Keizer, S.; Gasic, M.; Mairesse, F.; Thomson, B.; Yu, K.; Young, S. Modelling user behaviour in the HIS-POMDP dialogue manager. In Proceedings of the 2008 IEEE Spoken Language Technology Workshop, Goa, India, 15–19 December 2008; pp. 121–124. [Google Scholar]
- Chandramohan, S.; Geis, M.; Lefèvre, F.; Pietquin, O. User Simulation in Dialogue Systems Using Inverse Reinforcement Learning. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 1025–1028. [Google Scholar]
- El Asri, L.; Hem, J.; Suleman, K. A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems. Interspeech 2016, 1151–1155. [Google Scholar] [CrossRef] [Green Version]
- Kreyssig, F.; Casanueva, I.; Budzianowski, P.; Gasic, M. Neural User Simulation for Corpus-based Policy Optimisation of Spoken Dialogue Systems. arXiv 2018, arXiv:1805.06966. [Google Scholar]
- Schatzmann, J.; Thomson, B.; Weilhammer, K.; Ye, H.; Young, S. Agenda-based user simulation for bootstrapping a POMDP dialogue system. NAACL-HLT 2007, 149–152. [Google Scholar] [CrossRef]
- English, M.S.; Heeman, P.A. Learning mixed initiative dialog strategies by using reinforcement learning on both conversants. EMNLP 2005, 1011–1018. [Google Scholar] [CrossRef] [Green Version]
- Chandramohan, S.; Geist, M.; Lefèvre, F.; Pietquin, O. Co-adaptation in spoken dialogue systems. In Natural Interaction with Robots, Knowbots and Smartphones; Springer: New York, NY, USA, 2014; pp. 343–353. [Google Scholar]
- Das, A.; Kottur, S.; Moura, J.M.; Lee, S.; Batra, D. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE International Conference on computer Vision, Venice, Italy, 22–29 October 2017; pp. 2951–2960. [Google Scholar]
- Kottur, S.; Moura, J.; Lee, S.; Batra, D. Natural language does not emerge ‘naturally’ in multi-agent dialog. arXiv 2017, arXiv:1706.08502. [Google Scholar]
- Georgila, K.; Nelson, C.; Traum, D. Single-agent vs. multi-agent techniques for concurrent reinforcement learning of negotiation dialogue policies. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 22–27 June 2014; pp. 500–510. [Google Scholar]
- Lewis, M.; Yarats, D.; Dauphin, Y.; Parikh, D.; Batra, D. Deal or No Deal? End-to-End Learning of Negotiation Dialogues. arXiv 2017, arXiv:1706.05125. [Google Scholar]
- Bansal, T.; Pachocki, J.; Sidor, S.; Sutskever, I.; Mordatch, I. Emergent complexity via multi-agent competition. arXiv 2018, arXiv:1710.03748. [Google Scholar]
- Bernstein, D.S.; Givan, R.; Immerman, N.; Zilberstein, S. The complexity of decentralized control of Markov decision processes. Math. Oper. Res. 2002, 27, 819–840. [Google Scholar] [CrossRef]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Policy gradient methods. In Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 1998; pp. 329–331. [Google Scholar]
- Schatztnann, J.; Stuttle, M.N.; Weilhammer, K.; Young, S. Effects of the user model on simulation-based learning of dialogue strategies. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, San Juan, Puerto Rico, 27 November–1 December 2005; pp. 220–225. [Google Scholar]
- Li, X.; Lipton, Z.C.; Dhingra, B.; Li, L.; Gao, J.; Chen, Y.N. A user simulator for task-completion dialogues. arXiv 2016, arXiv:1612.05688. [Google Scholar]
- Kim, Y. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
- Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.R.; Le, Q.V. Xlnet: Generalized autoregressive pretraining for language understanding. In Proceedings of the 2019 Conference on Neural Information Processing Systems, Vancouver Convention Centre, Vancouver, BC, Canada, 8–14 December 2019; pp. 5754–5764. [Google Scholar]
Dialogue Acts | |
---|---|
Dialogue manager | ask_date,ask_location,ask_attendance, |
confirm_date,confirm_location,confirm_attendance,bye | |
User simulator | inform_date,inform_location,inform_attendance,update_date, |
update_location,update_attendance,affirm,deny,error,hello |
Group-S | Rule-S | Human Users | ||||
---|---|---|---|---|---|---|
SR | AT | SR | AT | SR | AT | |
M-MADM-OM | 0.902 | 18.86 | 0.925 | 17.28 | 0.84 | 18.04 |
Rule-M | 0.582 | 24.94 | 0.945 | 17.04 | 0.76 | 19.56 |
Yang2017 | 0.577 | 25.03 | 0.860 | 21.56 | 0.68 | 21.08 |
Zhao2016 | 0.433 | 27.77 | 0.890 | 20.99 | 0.68 | 20.02 |
Peng2018 | 0.428 | 27.86 | 0.945 | 18.44 | 0.72 | 20.32 |
3cM-MADM-OM | Rule-M | ||
---|---|---|---|
1 | SYS: Hello, where is the meeting? | 1 | SYS: Hello, where is the meeting? |
2 | USR: The location is in bupt technology building. | 2 | USR: The location is in bupt technology building. |
3 | SYS: Please confirm, the meeting place is in bupt technology building? | 3 | SYS: Please confirm, the meeting place is in bupt technology building? |
4 | USR: Yes. | 4 | USR: Yes. |
5 | SYS: How many people are there? | 5 | SYS: How many people are there? |
6 | USR: The meeting will be held at 10 a.m. on 5 April 2019. | 6 | USR: The meeting will be held at 10 a.m. on 5 April 2019. |
7 | SYS: Will the meeting be held at 10 a.m. on 5 April 2019? | 7 | SYS: How many people are there? |
8 | USR: Sorry, please change 2 p.m. on 5 April 2019. | 8 | USR: The meeting will be held at 10 a.m. on 5 April 2019. |
9 | SYS: Please confirm. the meeting be held at 2 p.m. on 5 April 2019? | 9 | SYS: How many people are there? |
10 | USR: Pefect. | . | |
11 | SYS: How many people are there? | . | |
12 | USR: Eight. | . | |
13 | SYS: Is there eight people to participate? | 32 | SYS: How many people are there? |
14 | USR: For sure. | ||
15 | SYS: Booked at 14:00 on 5 April 2019 Sixth meeting room at bupt technology building for you, if there are any questions, please call 010-52810888, good bye. | ||
Success | Failure: System does not book the right meeting room within the max turn |
Human | Users | |
---|---|---|
SR | AT | |
All | 0.80 | 19.80 |
Ask | 0.62 | 23.22 |
Confirm | 0.30 | 27.30 |
Bye | 0.22 | 20.00 |
Naive | 0.32 | 21.30 |
Rule-M | 0.76 | 19.56 |
Group-S | Rule-S | Corresponding-S | ||||
---|---|---|---|---|---|---|
SR | AT | SR | AT | SR | B | |
M-MADM-OM | 0.902 | 18.86 | 0.925 | 17.28 | 0.902 | 18.86 |
all&ask | 0.875 | 19.42 | 0.895 | 19.01 | 0.905 | 18.93 |
all&confirm | 0.860 | 19.64 | 0.895 | 18.95 | 0.910 | 18.92 |
all&bye | 0.825 | 20.34 | 0.905 | 18.65 | 0.905 | 18.85 |
all&naive | 0.835 | 20.07 | 0.880 | 19.27 | 0.900 | 18.94 |
ask&confirm | 0.825 | 24.94 | 0.645 | 23.73 | 0.895 | 18.91 |
ask&bye | 0.760 | 21.43 | 0.645 | 25.52 | 0.900 | 18.89 |
ask&naive | 0.815 | 20.02 | 0.550 | 24.73 | 0.895 | 18.93 |
confirm&bye | 0.730 | 22.08 | 0.590 | 18.95 | 0.895 | 18.91 |
confirm&naive | 0.725 | 22.23 | 0.505 | 17.04 | 0.905 | 18.89 |
MADM-S | Group-S | Rule-S | Human Users | |||||
---|---|---|---|---|---|---|---|---|
SR | AT | SR | AT | SR | AT | SR | AT | |
M-MADM-OM | 0.980 | 17.38 | 0.902 | 18.86 | 0.925 | 17.28 | 0.84 | 18.04 |
M-MADM-OO | 0.975 | 17.47 | 0.775 | 21.27 | 0.935 | 18.23 | 0.78 | 19.80 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lei, S.; Wang, X.; Yuan, C. Cooperative Multi-Agent Reinforcement Learning with Conversation Knowledge for Dialogue Management. Appl. Sci. 2020, 10, 2740. https://doi.org/10.3390/app10082740
Lei S, Wang X, Yuan C. Cooperative Multi-Agent Reinforcement Learning with Conversation Knowledge for Dialogue Management. Applied Sciences. 2020; 10(8):2740. https://doi.org/10.3390/app10082740
Chicago/Turabian StyleLei, Shuyu, Xiaojie Wang, and Caixia Yuan. 2020. "Cooperative Multi-Agent Reinforcement Learning with Conversation Knowledge for Dialogue Management" Applied Sciences 10, no. 8: 2740. https://doi.org/10.3390/app10082740
APA StyleLei, S., Wang, X., & Yuan, C. (2020). Cooperative Multi-Agent Reinforcement Learning with Conversation Knowledge for Dialogue Management. Applied Sciences, 10(8), 2740. https://doi.org/10.3390/app10082740