Author Contributions
Conceptualization, Z.W. and J.D.C.; methodology, Z.W. and J.D.C.; software: Z.W.; formal analysis, Z.W. and J.D.C.; investigation, Z.W. and N.K.; resources, Z.W., N.K. and T.C.; data curation, Z.W.; writing—original draft preparation, Z.W. and J.D.C.; supervision, J.D.C.; project administration, J.D.C.; funding acquisition, J.D.C. and T.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by InitialView.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Our annotatedcorpus and annotation guidelines are publicly available under the Apache 2.0 license on
https://github.com/emorynlp/MLCG accessed on 7 August 2023.
Acknowledgments
We gratefully acknowledge the support of the InitialView grant.
Conflicts of Interest
The authors declare no conflict of interest. The funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
References
- Zhang, Y.; Sun, S.; Galley, M.; Chen, Y.C.; Brockett, C.; Gao, X.; Gao, J.; Liu, J.; Dolan, B. DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020; pp. 270–278. [Google Scholar] [CrossRef]
- Adiwardana, D.; Luong, M.T.; So, D.R.; Hall, J.; Fiedel, N.; Thoppilan, R.; Yang, Z.; Kulshreshtha, A.; Nemade, G.; Lu, Y.; et al. Towards a human-like open-domain chatbot. arXiv 2020, arXiv:2001.09977. [Google Scholar]
- Roller, S.; Dinan, E.; Goyal, N.; Ju, D.; Williamson, M.; Liu, Y.; Xu, J.; Ott, M.; Smith, E.M.; Boureau, Y.L.; et al. Recipes for Building an Open-Domain Chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 July 2021; pp. 300–325. [Google Scholar] [CrossRef]
- Serban, I.; Sordoni, A.; Lowe, R.; Charlin, L.; Pineau, J.; Courville, A.; Bengio, Y. A hierarchical latent variable encoder-decoder model for generating dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- Mehri, S.; Razumovskaia, E.; Zhao, T.; Eskenazi, M. Pretraining methods for dialog context representation learning. arXiv 2019, arXiv:1906.00414. [Google Scholar]
- Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H. PLATO: Pre-trained Dialogue Generation Model with Discrete Latent Variable. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 85–96. [Google Scholar] [CrossRef]
- Zhou, H.; Ke, P.; Zhang, Z.; Gu, Y.; Zheng, Y.; Zheng, C.; Wang, Y.; Wu, C.H.; Sun, H.; Yang, X.; et al. Eva: An open-domain chinese dialogue system with large-scale generative pre-training. arXiv 2021, arXiv:2108.01547. [Google Scholar]
- Xu, J.; Szlam, A.; Weston, J. Beyond Goldfish Memory: Long-Term Open-Domain Conversation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5180–5197. [Google Scholar] [CrossRef]
- Bao, S.; He, H.; Wang, F.; Wu, H.; Wang, H.; Wu, W.; Wu, Z.; Guo, Z.; Lu, H.; Huang, X.; et al. Plato-xl: Exploring the large-scale pre-training of dialogue generation. arXiv 2021, arXiv:2109.09519. [Google Scholar]
- Ilievski, V.; Musat, C.; Hossmann, A.; Baeriswyl, M. Goal-oriented chatbot dialog management bootstrapping with transfer learning. arXiv 2018, arXiv:1802.00500. [Google Scholar]
- Lian, R.; Xie, M.; Wang, F.; Peng, J.; Wu, H. Learning to select knowledge for response generation in dialog systems. arXiv 2019, arXiv:1902.04911. [Google Scholar]
- Cunningham-Nelson, S.; Boles, W.; Trouton, L.; Margerison, E. A review of chatbots in education: Practical steps forward. In Proceedings of the 30th Annual Conference for the Australasian Association for Engineering Education (AAEE 2019): Educators Becoming Agents of Change: Innovate, Integrate, Motivate, Brisbane, Australia, 8–11 December 2019; pp. 299–306. [Google Scholar]
- Fan, X.; Chao, D.; Zhang, Z.; Wang, D.; Li, X.; Tian, F. Utilization of self-diagnosis health chatbots in real-world settings: Case study. J. Med. Internet Res. 2021, 23, e19928. [Google Scholar] [CrossRef]
- Amiri, P.; Karahanna, E. Chatbot use cases in the Covid-19 public health response. J. Am. Med. Inform. Assoc. 2022, 29, 1000–1010. [Google Scholar] [CrossRef]
- Baier, D.; Rese, A.; Röglinger, M.; Baier, D.; Rese, A.; Röglinger, M. Conversational User Interfaces for Online Shops? A Categorization of Use Cases. In Proceedings of the International Conference on Information Systems, Libertad City, Ecuador, 10–12 January 2018. [Google Scholar]
- Nichifor, E.; Trifan, A.; Nechifor, E.M. Artificial intelligence in electronic commerce: Basic chatbots and the consumer journey. Amfiteatru Econ. 2021, 23, 87–101. [Google Scholar] [CrossRef]
- Ahmadvand, A.; Choi, I.; Sahijwani, H.; Schmidt, J.; Sun, M.; Volokhin, S.; Wang, Z.; Agichtein, E. Emory irisbot: An open-domain conversational bot for personalized information access. Alexa Prize. Proc. 2018. [Google Scholar]
- Wang, Z.; Ahmadvand, A.; Choi, J.I.; Karisani, P.; Agichtein, E. Emersonbot: Information-focused conversational AI Emory university at the Alexa Prize 2017 challenge. 1st Proceeding Alexa Prize. 2017. [Google Scholar]
- Finch, S.E.; Finch, J.D.; Ahmadvand, A.; Dong, X.; Qi, R.; Sahijwani, H.; Volokhin, S.; Wang, Z.; Wang, Z.; Choi, J.D.; et al. Emora: An inquisitive social chatbot who cares for you. arXiv 2020, arXiv:2009.04617. [Google Scholar]
- Safi, Z.; Abd-Alrazaq, A.; Khalifa, M.; Househ, M. Technical aspects of developing chatbots for medical applications: Scoping review. J. Med. Internet Res. 2020, 22, e19127. [Google Scholar] [CrossRef] [PubMed]
- Khoa, B.T. The Impact of Chatbots on the Relationship between Integrated Marketing Communication and Online Purchasing Behavior in The Frontier Market. J. Messenger 2021, 13, 19–32. [Google Scholar] [CrossRef]
- Okonkwo, C.W.; Ade-Ibijola, A. Chatbots applications in education: A systematic review. Comput. Educ. Artif. Intell. 2021, 2, 100033. [Google Scholar] [CrossRef]
- Li, J.; Zhou, M.X.; Yang, H.; Mark, G. Confiding in and listening to virtual agents: The effect of personality. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, 13–16 March 2017; pp. 275–286. [Google Scholar]
- Kim, S.; Lee, J.; Gweon, G. Comparing Data from Chatbot and Web Surveys: Effects of Platform and Conversational Style on Survey Response Quality. In Proceedings of the 2019 CHI Conference, Scotland, UK, 4–9 May 2019; pp. 1–12. [Google Scholar] [CrossRef]
- Minhas, R.; Elphick, C.; Shaw, J. Protecting victim and witness statement: Examining the effectiveness of a chatbot that uses artificial intelligence and a cognitive interview. AI Soc. 2022, 37, 265–281. [Google Scholar] [CrossRef]
- Ni, L.; Lu, C.; Liu, N.; Liu, J. Mandy: Towards a smart primary care chatbot application. In Proceedings of the International Symposium on Knowledge and Systems Sciences; Springer: Singapore, 2017; pp. 38–52. [Google Scholar]
- Xiao, Z.; Zhou, M.X.; Fu, W.T. Who should be my teammates: Using a conversational agent to understand individuals and help teaming. In Proceedings of the 24th International Conference on Intelligent User Interfaces, Marina del Ray, CA, USA, 16–20 March 2019; pp. 437–447. [Google Scholar]
- Siddig, A.; Hines, A. A Psychologist Chatbot Developing Experience. In Proceedings of the AICS, Wuhan, China, 12–13 July 2019; pp. 200–211. [Google Scholar]
- Al Adel, A.; Burtsev, M.S. Memory transformer with hierarchical attention for long document processing. In Proceedings of the 2021 International Conference Engineering and Telecommunication (En&T), Dolgoprudny, Russian, 24–25 November 2021; pp. 1–7. [Google Scholar] [CrossRef]
- Raheja, V.; Tetreault, J. Dialogue Act Classification with Context-Aware Self-Attention. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 3727–3733. [Google Scholar] [CrossRef]
- Ghosh, S.; Varshney, D.; Ekbal, A.; Bhattacharyya, P. Context and Knowledge Enriched Transformer Framework for Emotion Recognition in Conversations. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv 2019, arXiv:1910.13461. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Anguera, X.; Bozonnet, S.; Evans, N.; Fredouille, C.; Friedland, G.; Vinyals, O. Speaker diarization: A review of recent research. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 356–370. [Google Scholar] [CrossRef] [Green Version]
- Nakayama, H.; Kubo, T.; Kamura, J.; Taniguchi, Y.; Liang, X. doccano: Text Annotation Tool for Human. 2018. Available online: https://github.com/doccano/doccano (accessed on 7 August 2023).
- He, H.; Xu, L.; Choi, J.D. ELIT: Emory Language and Information Toolkit. arXiv 2021, arXiv:2109.03903. [Google Scholar]
- Stolcke, A.; Ries, K.; Coccaro, N.; Shriberg, E.; Bates, R.; Jurafsky, D.; Taylor, P.; Martin, R.; Van Ess-Dykema, C.; Meteer, M. Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. Linguist. 2000, 26, 339–374. [Google Scholar] [CrossRef] [Green Version]
- Smith, E.M.; Williamson, M.; Shuster, K.; Weston, J.; Boureau, Y.L. Can You Put it All Together: Evaluating Conversational Agents’ Ability to Blend Skills. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online; 2020; pp. 2021–2030. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Shuster, K.; Xu, J.; Komeili, M.; Ju, D.; Smith, E.M.; Roller, S.; Ung, M.; Chen, M.; Arora, K.; Lane, J.; et al. BlenderBot 3: A deployed conversational agent that continually learns to responsibly engage. arXiv 2022, arXiv:2208.03188. [Google Scholar]
- Montahaei, E.; Alihosseini, D.; Baghshah, M.S. Jointly Measuring Diversity and Quality in Text Generation Models. arXiv 2019, arXiv:1904.03971. [Google Scholar]
Figure 1.
The interviewees’ age demographics.
Figure 2.
The interviewees’ country demographics.
Figure 3.
The interviewees’ gender demographics.
Figure 4.
Distributions of the diarization error types.
Section 5.2 provides examples of each error type.
Figure 5.
The overview of our diarization model.
Figure 6.
The overview of our dialogue generation model.
Table 1.
Distributions of our data. D: num of dialogues, U: avg-num of utterances per dialogue, S1/S2: avg-num of tokens per utterance by interviewer/interviewee. TRN/DEV/TST: training/development/evaluation (annotated) sets. RAW: unannotated set (auto-transcribed).
| D | U | S1 | S2 |
---|
TRN | 140 | 43.8 | 39.3 | 64.0 |
DEV | 150 | 45.0 | 36.2 | 60.3 |
TST | 150 | 44.3 | 37.8 | 61.3 |
RAW | 6921 | 40.4 | 41.5 | 67.6 |
Table 2.
Distributions of filler words with regard to diarization errors. Dist: percentage of dialogues containing # number of utterances with errors caused by the filler words. filler_word: percentage of the filler word appearing in the corresponding dialogue group.
# | Dist | okay | yeah | right | um | so | uh | well | like | oh |
---|
2 | 40.4 | 46.7 | 16.0 | 8.0 | 8.5 | 8.0 | 4.4 | 4.7 | 0.2 | 3.1 |
3 | 35.9 | 33.3 | 29.8 | 3.9 | 8.5 | 11.1 | 6.2 | 2.3 | 0.6 | 4.1 |
4 | 8.6 | 33.7 | 24.5 | 5.6 | 9.7 | 11.2 | 5.1 | 3.6 | 1.0 | 5.1 |
5 | 7.3 | 28.9 | 30.7 | 6.6 | 4.2 | 15.1 | 6.6 | 4.8 | 0.6 | 2.4 |
Table 3.
Distributions of the pseudo-created datasets (Switchboard, BST) and our interview data (before and after diarization). D: number of dialogues, U: avg-number of utterances, S1/S2: avg-number of tokens per utterance by S1/S2. TRN/DEV/TST: training/development/evaluation (annotated) sets. RAW: unannotated set. Note that we follow the same splits suggested by the original papers of the Switchboard and BST datasets for comparability.
| Switchboard | BlendedSkillTalk (BST) | Interview Data (Before) | (After) |
---|
| D | U | S1 | S2 | D | U | S1 | S2 | D | U | S1 | S2 | U | S1 | S2 |
---|
TRN | 1115 | 42.2 | 31.6 | 31.3 | 4819 | 6.2 | 25.0 | 25.2 | 140 | 42.2 | 37.4 | 73.1 | 43.8 | 39.3 | 64.0 |
DEV | 21 | 16.6 | 49.4 | 43.7 | 1009 | 6.1 | 25.8 | 25.3 | 150 | 44.1 | 34.9 | 67.3 | 45.0 | 36.2 | 60.3 |
TST | 19 | 32.7 | 32.9 | 32.9 | 980 | 6.2 | 26.2 | 26.2 | 150 | 44.2 | 34.2 | 69.0 | 44.3 | 37.8 | 61.3 |
RAW | N/A | N/A | 6921 | 39.6 | 38.2 | 75.1 | 40.4 | 41.5 | 67.6 |
Table 4.
An interview dialogue conducted by our best model (CT in
Section 4).
S1/S2: interviewer/interviewee (chatbot/human),
B/E: beginning/ending utterance (chatbot),
Q: topical question (chatbot).
Flag | Utterance |
---|
B | Hello, and thank you so much for coming today. uh, Let us start by having you just to state your full legal name and also the school that you are currently attending? |
S2 | Sure, my name is David, and I go to Shanghai high school in China. |
Q | Great. um, yeah, let’s talk about leadership experiences. Can you give me some specific examples? |
S2 | Sure, when I was a junior, I was elected to be the Vice President in the Student Union and I organized many activities to help students to enrich their life on campus. |
S1 | Okay, great. Thank you so much for sharing that. so, um, tell me a little bit about your leadership style. What do you think you bring to the student union? |
S2 | Sure, I am very open-minded, and with some of my colleagues, we together established a lot of new programs on top of our existing ones. Many of those programs became very popular. |
S1 | Ok, great. and so, um, what, uh, what would you say has been the most challenging aspect of being a leader? |
⋯ |
Q | Great. uh, um, hmm, let’ talk about your courses. What courses did you take in high school and which courses do you like most? |
S2 | Sure, I took Math, English, PE, Chemistry, Art, and so on. I like Math most. |
Q | Okay. so, um, tell me a little bit about your, your interest in mathematics. |
⋯ |
E | Okay, great. well, thank you very much for sharing that with me. um, and thank you for taking the time out of your day to interview with us. |
Table 5.
Diarization model performance. Ours: trained on
TRN of our interview data (after) in
Table 3. Transferred: trained first on the
TRN mixture of Switchboard and BST, then finetuned on
TRN of our data.
| Baseline | Context | Joint |
---|
Ours only | 92.9 ± 0.4 | 92.9 ± 0.3 | 92.9 ± 0.2 |
Transferred | 93.2 ± 0.3 | 93.4 ± 0.3 | 93.6 ± 0.3 |
Table 6.
The error analysis of all generation models. R: avg-% of repeated topics, EE: avg-% of the interview conducted before the model attempts to end (higher is better), OT: avg-% of off-topic utterances.
Model | Repetitions (R) | Early Ending (EE) | Off Topic (OT) |
---|
BB | 30.0 | 13.3 | 20.0 |
SW | 16.7 | 23.3 | 26.7 |
CT | 6.7 | 46.7 | 10.0 |
Table 7.
The rating distribution of the InterviewBot conversations for real-time evaluation. 5: very satisfied, 4: satisfied, 3: neutral, 2: unsatisfied, 1: very unsatisfied.
Score | 5 | 4 | 3 | 2 | 1 | Average Score |
---|
Interviewer (Count) | 1 | 1 | 2 | 1 | 0 | 3.4 |
Student (Count) | 2 | 3 | 4 | 0 | 1 | 3.5 |
Total (Count) | 3 | 4 | 6 | 1 | 1 | 3.5 |
Table 8.
Speaker diarization error analysis in the text transcription from interview recordings. Notions are Misc: miscellaneous; ASR: auto speech recognition; WR: word repetition; FW: filler words; AC: adjacent concatenation.
Count | Type | | |
---|
212 | Misc | ASR: 147 | WR: 65 |
289 | FW | | |
378 | AC | | |
# of Convs | 100 | | |
Avg # of turns | 39 | | |
Table 9.
Error type examples.
Error Type | Examples |
---|
ASR | Um, okay. My name is <inaudible>.
I’m a senior year student come from Hunger-Free language school. |
WR | I heard it said, so it’s kind of like a DIYcommunity community. |
FW | Oh, no, I’m going to make majoring mathematics. Okay. Okay. Now why, why do you think receiving an education is important? |
AC | What do you think the benefits might be of this kind of technology? If we develop it, I think this technology will eventually replace, um, human delivery. |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).