1. Introduction
Researchers and engineers have been building service robots that can interact with people and achieve given tasks. To deploy practical service robots, two major concerns need to be seriously considered, including the system architecture for launching the services and the creation of the service functions. At present, the services are mostly laboring services, in which robots take actions in the physical environment to assist people. However, robots are now expected to play more important roles in providing domain-specific knowledge services and task-oriented services. To deliver these services, robots communicate with users through a natural way of spoken language because conversation is a key instrument for developing and maintaining mutual relationships. Following our previous studies that adopted a service-oriented architecture to develop action-oriented robot services, in this work we presented a trainable framework for modeling emotion-aware human–robot dialogues to provide the aforementioned services.
Regarding the many choices of supportive software architecture, some researchers have proposed to adopt cloud-based service-oriented architecture (SOA). SOA is an architectural style based on interacting software components, providing services as fundamental units to design, build and compose the service-oriented software systems [
1]. A service is a function made available by a service provider in order to deliver results to a consumer. Moreover, services are autonomous platform-independent entities that can be described, published, discovered and loosely coupled. To effectively and efficiently deploy different kinds of services, researchers have proposed to link SOA to a cloud computing environment. With this way, the robots are no longer limited by onboard computation, memory and programming, leading to a more intelligent robotic network. Our former work implemented a cloud-based system to support a variety of user-created services [
2,
3]. To ensure its expandability and shareability, we constructed a service configuration mechanism and deployed the system on the ROS (robot operating system, [
4]) computing nodes in practice.
The most common way for achieving natural language-based human–robot interaction is to build a dialogue system to be a vocal interactive interface. Essentially, the dialogue system includes a knowledge base (i.e., dataset) with organized domain questions and their corresponding answers and the dialogue service is to design an accurate mapping mechanism that can correctly retrieve answers in response to the users’ questions. The system is performed in a question-answering manner, and most traditional approaches are based on hand-crafted rules or templates. Recently, the deep learning-based methods have been successfully employed to infer neural models for question and answer sentences. These neural systems mainly use a sequence to sequence (seq2seq) model as a backbone to perform mappings from entire sequences of words or characters to other sequences, for example [
5,
6]. In addition to the dialoguing content, emotion plays a significant role in determining the relevance of the answer to a specific question. By integrating emotion information into the applications, a service system can enable its services to automatically adapt to changes in the operational environment, leading to enhanced user experience.
To enhance the service performance and equip the robot with social competences, in this work, we developed an emotion-aware human–robot dialogue framework extended from our previous research presented in [
7], with a series of additional experiments and newly developed dialogue services. To this end, this extended framework included two types of dialogue services. One was to enable the robot to work as a consultant to provide domain-specific knowledge services. The main focus was on constructing a deep learning model for mapping questions and answer sentences, tracking the human emotion during the process of the human–robot dialoguing and using this additional information to determine the relevance of the sentences obtained by the model. The other was to provide task-oriented dialogue services which raised considerable interests due to its broad applicability for assisting users in achieving specific goals (e.g., for booking flight tickets or scheduling meetings). To verify the presented approach, we conducted a series of experiments as described below to evaluate the major system components. The results showed the effectiveness and efficiency of the presented approach.
The remaining part of this paper is arranged as follows.
Section 2 provides the research background and reviews the dialogue-related research work.
Section 3 describes the framework, including the functional modules of emotion classification and dialogue response selection, and the deep learning techniques used for modeling.
Section 4 presents the experimental outcomes and the performance comparisons of the different methods. Finally,
Section 5 concludes the paper.
2. Related Works
As mentioned previously, at present most of the service robot frameworks have been connected to various cloud-computing environments to exploit their large amounts of resources. Among others, the most representative work is RoboEarth [
8], driven by an open-source cloud robotics platform [
9]. With this platform, the robots can distribute highly loaded computation to the cloud and access the RoboEarth knowledge repository to download required resources. There are also other platforms developed for cloud robotic systems. For example, Pereira et al. proposed the ROSRemote framework [
10], which enabled users to work with ROS remotely to create several applications. More extensive surveys were found in [
11,
12]. More recently, due to the rapid advances of the Internet of Things (IoT), researchers proposed the concept of the Internet of Robot Things (IoRT) to describe a new approach to robotics [
13,
14]. In this way, smart devices can monitor events, fuse sensor data from a variety of sources and use local and distributed intelligence to determine a best course of action. This expands the ability of service robots, improves a robot’s understanding during the human–machine interaction and leads to a more intelligent robotic network. Moreover, to deal with the scalability problem, researchers have started to extend the cloud computing concept for service robots to edge or fog computing to utilize the resources in a more efficient way [
15,
16].
Instead of investigating issues related to resource allocation and utilization, this work aimed to develop emotion-aware dialogues for a service robot, in which the most important issues were to recognize the emotions from the user utterances and to generate appropriate machine responses. Many methods have been proposed to solve these problems from different perspectives. Because this work adopted deep learning models to address the above two issues, in the following we discuss the most relevant studies with similar computational methods.
In general, using a deep learning-based approach to develop dialogues, responses are generated based on sequence-to-sequence (seq2seq) neural network models, with an objective function of the maximum-likelihood estimation [
17]. This model is to take dialogue modeling as learning a mapping between human utterances and machine responses. The focus is on how to generate a suitable response from a corpus to a human utterance. For the training of dialogue models, generative and retrieval-based methods are often used. Although generative methods have the potential to generate sentences of rich content, current generative models often have the disadvantages of lacking coherence and producing unnatural responses. In contrast, though retrieval-based methods are more restricted, they have the advantage of producing informative and fluent responses. Thus, the retrieval-based methods are more practical. As can be observed, retrieval-based methods rely on the exploitation of a large and varied corpus (human–human or human–machine interactions) [
18] and deep learning models have been employed to derive mappings (that is, a selection mechanism) between questions and answers (e.g., [
5,
19]).
The basic seq2seq model consists of two recurrent neural networks (RNNs): one works as an encoder to process the input; the other, a decoder to generate the output. With the characteristic of making predictions based on running texts of varying lengths, the long short-term memory networks (LSTMs) are often adopted to train the answer selection mechanism. This model has now been widely applied to conversation generation and most existing works have mainly focused on developing more advanced techniques (such as decoding strategies or network models) to improve the content quality of the responses. Many neural dialogue systems have been constructed based on this design principle. For example, Serban et al. used a hierarchical LSTM network for a conversation application [
20], and Wen et al. proposed a task-oriented model to generate the correct answers in response to the needs of the given dialogue [
21]. To overcome the problem of overly general (i.e., safe) responses, Wu et al. proposed a hybrid-level encoder–decoder model, which utilized both word-level and character-level features [
22]. Although these models, in theory, are better at maintaining the dialogue state using memory components, they require longer training time and excessive searching for hyper-parameters.
In contrast to the above domain-specific dialogue systems that aim to generate fluent and engaging responses, the other type of neural dialogue systems that has attracted a lot of attention is task-oriented [
23,
24]. Task-oriented dialogue systems need to complete a specific task (to achieve a goal), for example, restaurant reservation, by interacting with users (i.e., a response generation process). Existing task-oriented systems can be divided into two categories: the modularized pipeline and the end-to-end single-module systems. The former decomposes the task-oriented dialogue task into modularized pipelines to be solved separately, while the latter proposes to use an end-to-end model to produce a sequence of output tokens directly to solve the overall task. End-to-end systems are often more superior than pipeline systems, due to their unique characteristics, such as global optimization and easier adaptation to new domains. In the task-oriented dialogue systems, the most critical component is the goal tracker [
25]. The system must update the state of the dialogue according to each user’s query and their intent. Given the current dialogue state, the system can then decide how to respond best to the user to accomplish the desired task.
In addition to employing more sophisticated models and advanced tuning mechanisms towards proper response generation, some recent works attempted to augment the emotional information of the neural dialoguing models to generate more meaningful and humanized machine responses. For example, Zhou et al. presented a model that assumed the emotion category of human utterance was known and taken as an additional input to train a model of responses [
26]. Sun et al. adopted a LSTM neural network for conversation modeling [
27] in which an emotional category label was added to the encoder, which regarded emotional information as an additional source to the conversational model. Moreover, Asghar et al. discussed the feasibility of employing emotion information to help generate diverse responses [
28]. They proposed a model of affective response generation to generate sentences conditioned on emotional word embeddings, affective objective functions and diverse beam search. However, these methods only focused on emotional factors while ignoring content relevance, possibly resulting in a decline in the quality and diversity of a response. The integration of emotion and content is still a challenging task for several reasons. The first is that high-quality emotion-labeled data are difficult to obtain in a large-scale corpus because emotions are subjective and difficult to annotate. Moreover, it is difficult to deal with emotions coherently because balancing grammaticality and the expressions of emotions is needed [
29].
4. Experiments and Results
To evaluate the presented emotion-aware dialoguing service for human–robot interaction, several sets of experimental trials were conducted. As mentioned previously, due to the lack of a dataset with full information on the human face, utterance emotion and dialoguing content, in the experiments we used four datasets to evaluate these modules separately. The evaluations are described in the following subsections.
4.1. Performance Metrics
In the experiments, we employed the criteria often used in data classification for performance evaluation and a five-fold cross-validation strategy was also used. We first measured the numbers of true positives (TP), false positives (FP), true negatives (TN), false negatives (FN) and then used them to calculate the metrics of accuracy (proportion of correctly predicted instances relative to all predicted instances), precision (proportion of retrieved instances that were relevant), recall (proportion of relevant instances that were retrieved) and F-measure (the combined effect of precision and recall that often conflict in nature) [
43]. The metrics are defined as follows:
In addition to accuracy, to evaluate the performance of the answer selection in the dialogue modeling, we adopted a statistical measure MRR (mean reciprocal rank, the average of the reciprocal ranks of results for a sample of
n queries). It is defined as
where
ranki refers to the rank position of the first relevant document for the
i-th query.
4.2. User Identification
In this work, a cloud-based system was built for a service robot and we configured a ROS framework on top of a Linux OS to connect the sensing camera nodes. Often, a system built with ROS consists of a number of processes on a set of hosts, which are connected at runtime in a peer-to-peer topology. Here, the ROS master was a PC running the roscore and serving as the resource center for all the other ROS nodes connected to the network. The cloud parallel computing virtual machine had eight CPUs and eight GB memory, and the GPU acceleration virtual machine had eight CPUs, 32 GB memory and a NVIDIA Tesla K80 GPU.
For user identification, the experiments were conducted to evaluate the performance of face recognition. The goal was to train the robot to recognize human faces in a static manner and we adopted OpenCV (
https://opcv.org, an open source computer vision library) to train the classifiers. An online face dataset [
44] was used. It included 90 image sets of different persons, in which each set included face images taken from different viewpoints, from 90 to −90 degrees (stepping by 5). The results showed that the trained classifiers performed the best in the recognition of the front face images. The faces in the images could be detected correctly with a reasonable rate of accuracy when the variation of the rotating angle was less than 30 degrees, and the faces could be recognized with a good accuracy if the view angle was within the range of 10 to −10 degrees.
4.3. Performance of Emotion Recognition
4.3.1. Performance Evaluation
To assess the performance of the emotion recognition module, we adopted the dataset used in [
45], which was derived from the Movie Dialog Corpus. The sentences in this dataset were categorized into six classes of emotions: fear, disgust, joy, sadness, anticipation and none (neutral). The deep learning approach described in
Section 3.2.2 was employed to train a model for multi-class emotion recognition. In addition, two popular learning methods, the random forest (RF) and the support vector machine (SVM) methods, were used for performance comparison.
For RF and SVM, we used the n-gram method to extract more text features from the original data for building classifiers to enhance their performance, in addition to the word features extracted from the text-processing procedure. N-gram can express the sequence relationships between the words, and the unigram, bigram and trigram (n is 1, 2 and 3, respectively) models are often used. After a preliminary test, in this work we used the above three models to extract more text features, and the combined feature vectors were used as the input of the above two machine learning methods (RF and SVM) to enhance their performance.
Figure 6a illustrates the accuracy, precision, recall and F-score for each of the three methods. As can be seen, RF performed the best in all the metrics. The main reason could be that RF is a type of ensemble machine learning algorithm and the way it handled (samples) data for the grouped multiple classifiers made it perform better than the others for the imbalanced dataset here.
After comparing the three aforementioned methods, we applied two data processing techniques to the dataset, including semantic rules and data balance, with the above learning methods to investigate their effects in performance. For the semantic rules, the five rules mentioned in
Section 3.2.1 were used to perform more precise sentence segmentation; for data balance, we adopted the sciki-learn tool to produce a set of specific class weights for different types of emotions. The results for accuracy, precision, recall and F-score are illustrated in
Figure 6b. As can be seen, in general our CNN-LSTM method obtained the best results on all performance metrics. In addition to the data balance effect, the reason for the performance improvement could be that the semantic rules removed the irrelevant words and filtered out their effects on the sentence emotions. Thus, the learning methods were able to focus on the emotions delivered by the most related parts of the sentences to be predicted.
4.3.2. Comparisons with IBM Tone Analyzer
In addition to comparing the different machine learning methods, we evaluated a well known emotion detection system, the IBM Watson’s Tone Analyzer (
https://natural-language-understanding-demo.ng.bluemix.net/), for further comparison. Interestingly, the emotions the Tone Analyzer considered were slightly different from what we defined in this work, and it gave degrees (values) of multiple emotions for an input sentence (also different from our work). To conduct the performance comparison, we projected the two sets of emotions (one for our work and one for the Tone Analyzer) into the well known emotional valence and arousal space (i.e., V-A space, [
46]). In this space, valence indicates the hedonic value (positive or negative), ranging from inactive to active; and arousal indicates the emotional intensity, ranging from unpleasant to pleasant. The valence and arousal dimensions can be projected onto Euclidean space, where emotions are represented as point-vectors. In this way, the user’s emotion can be located in this space to be a tuple of valence and arousal values.
In the experiments, we first projected the five classes (annotated in the dataset) into the V-A space to retrieve the corresponding valence and arousal values (based on the emotion positions defined in [
46]). For the data of each (sentence), we took the positions of the actual (correct) class and the predicted classes in the space and obtained the set of V-A values. Then, the emotion values produced by the trained model were taken as class weights and the weighted sum was derived for these specific data. Consequently, our method and the Tone Analyzer were compared.
For the data of each, we chose the two closest classes (with the largest weights) and calculated their weighted distance to represent the distance between the predicted and the actual classes. To compare the performances, we divided the distance into eight intervals and counted the numbers of data within each interval.
Table 1 presents the results, in which
x is the weighted distance. As can be seen, in general the results obtained by the presented method were better than those obtained by the Tone Analyzer for the dataset used.
4.4. Performance of Training a Dialogue Model
The next set of experiments was to examine the system performance of model training in retrieving (selecting) answers. In this series of experiments, a large dataset was adopted [
38]. It was collected from the Insurance Library website that included 12,889 questions and 21,325 answers, after a data preprocessing procedure was performed. This procedure was to remove unsuitable data that could not form the proper input question−answer pairs, to clean the irrelevant terms (such as html tags) and to transfer the text context into internal identifiers (to form the vectors). In the experiments, the above dataset was divided into two parts, in which a part of 2000 questions and a part of 3308 answers were used for testing. The complete experiments of dialoguing were described in our previous work [
35], and here we focused on reporting the results most related to the model training for human–robot interaction.
As described in
Section 3.3, in the model training phase, for each question sentence a positive and a negative answer were needed to constitute a training instance. However, in a real-world application, the correct answer
A+ for a question
Q can be determined easily (by the confirmation of the person asking the question), while the wrong answers are often not explicitly specified. Therefore, in the experiments here, all other answers in the dataset were considered candidates of wrong answers to
Q. To find the most suitable wrong answer
A− for each question in the dataset, we used the above model training procedure to perform the preprocessing procedure of the wrong answer selection. Due to the large amount of answers, in this work we randomly chose ten (instead of all) answers for each question to perform training to reduce computational time.
In the learning process, the random shuffling strategy was used to combine the correct and wrong answers for each question to work as the training data. The model and method presented in
Section 3.3 were used for training.
Figure 7a illustrates the results of the two performance metrics often used in retrieval-based dialogue modeling, accuracy and MRR. Here, the accuracy was in fact the top-one precision mentioned in the other relevant studies. It means that the model’s predictive result (i.e., the top score answer) must be exactly the expected one as recorded in the dataset. As shown, the LSTM-CNN model could achieve the best performance with a correct prediction rate of 0.61 and the MRR was 0.70. The results were similar to those presented in the related study [
38], whereas the presented method involved a smaller set of parameters and was more efficient in learning. In addition to the LSTM-CNN model, a traditional embedding model (using only word embedding technique) was also implemented for performance comparison. The results are shown in
Figure 7b. As presented, the accuracy of the embedding model is 0.12 and for the MRR is 0.21. These results indicated that the LSTM-CNN model was more efficient; it obtained a better result within less iterations.
In addition to the performance evaluation of the dialogue modeling, we performed another set of trials to examine the performance of the shared knowledge translated by a dataset from a different language. In the experiments, the dataset used in the above set of experiments was translated according to the steps described in
Section 3.3, and the same deep learning model and method were used for the training. As mentioned previously, in contrast to English sentences, a Chinese sentence could be segmented into various combinations of words by different segmentation methods and this often led to different modeling results. Therefore, before evaluating the performance of the model training, we conducted a set of trails to investigate the effect of two popular segmentation methods: the Jiaba and the HanLP, and the results showed that the Jiaba performed better than the HanLP. We thus chose Jiaba segmentation to continue the experiments of model training.
The results (i.e., accuracy and MRR) are presented in
Figure 8a. As shown in the figure, the LSTM-CNN model can achieve a best performance (accuracy) of 0.54 and an MRR of 0.64. Moreover, the traditional embedding model was implemented for comparison and the results are shown in
Figure 8b. Similar to the experiments conducted for the original (untranslated) dataset, the results here indicated that the LSTM-CNN model was more efficient than the traditional embedding method.
Compared to the results obtained from the original dataset, the accuracy declined from 0.61 to 0.54. This indicated that the model built from the translated knowledge (i.e., dataset) could not keep the modeling performance at the same level; nevertheless, the results showed that the translated knowledge was learnable with an acceptable performance and was thus useful in building models for the resource-restricted language. The performance could be further improved when more advanced text translation techniques are applied.
4.5. Evaluation of Training Task-Oriented Dialogues
4.5.1. Performance Evaluation of Neural Belief Tracker
As mentioned previously, the belief tracker plays an important role in a goal-oriented dialogue system; it can be used to track each participant’s intention from the continuous dialoguing utterances between the participant and the robot. In this work, we implemented a deep CNN model to work as a neural belief tracker. The application task was to perform restaurant recommendation through the user−robot dialogue. A set of entities were pre-defined and the robot iteratively interacted with the user to derive all the missing entity values. The DSTC6 dataset was used for training the tracker. In this task (dataset), five entities were tracked, namely cuisine, location, price range, atmosphere and party size, and the system had to infer their corresponding slot values for making an appropriate recommendation.
Table 2 lists the values defined for the entities.
To achieve the task, a state tracker was trained for each entity. In the experiments, the performance metrics were the accuracy (the number of correct responses divided by the number of turns) and the loss (here, root mean square error). The training performance for all the entities is presented in
Figure 9, in which (a) shows how the accuracy was improved during the training process, and (b) illustrates the reduced loss. As is shown in
Figure 9, an accuracy of 0.9 can be obtained after 200 epochs and the loss converged to a small value after 75 epochs and approximated toward zero in the end of the training (200 epochs).
4.5.2. Performance Evaluation of Autoencoder
The DSTC6 dataset used in the above section for training the neural belief tracker contained only dialogue information which cannot be used for making a recommendation. Therefore, in this section we adopted another public dataset (i.e., Yelp [
47]) to evaluate the recommendation performance of the presented model. The original dataset contained a large amount of users and their ratings of a set of shops. This dataset had a very high sparsity. To achieve our task of restaurant recommendation and to connect the recommendation module to the dialogue system, we chose the relevant data (69,634 users, 41,019 restaurants and 1,817,955 ratings) to evaluate our approach.
As mentioned above, we revised the autoencoder model through a set of experimental investigations to enhance the corresponding performance. The first phase was to investigate the effect of the code size (the number of nodes in the code layer). A set of code sizes (32, 64, 128 and 256) were evaluated and the results showed that with a size of 32, the model could obtain its best performance. After the preliminary test for code size, in the second phase we evaluated the performance of the different activation functions, including ELU, SeLU, ReLU, Sigmoid and tanh, which were often used in deep learning models. The loss (root mean squared error) was employed to measure the prediction performance and the results are shown in
Figure 10, in which (a) is the training process and (b) is the corresponding test process.
Figure 10 indicates that the overfitting situation occurred in all cases and the case of ELU obtained the best result. We then performed an additional set of trials on the dropout (the dropping out unit in a neural network) and chose a dropout value of 0.8 to alleviate the overfitting. The third phase was to investigate the effect of the number of hidden layers arranged in the deep network. In this set of experiments, we evaluated five different numbers of layers: 2, 4, 6, 8 and 10, and the results are shown in
Figure 11. As shown in
Figure 11, though with more hidden layers the model can obtain better training performance, it caused overfitting. We thus chose to use six hidden layers in the final experiments for performance comparison.
After conducting the above evaluation steps for the determination of the network parameters, we then compared our enhanced approach to other popular collaborative filtering methods, including the well known autoencoder AutoRec, and the latent factor model NNMF (non-negative matrix factorization [
48]) which is one of the best models in the relevant studies. In the experiments, for all three methods, the number of epochs was 100, the code size (for our model and AutoRec) and the latent factor (for NNMF) was 32, and the learning rate (for our model and AutoRec) was 0.005. As a result, the loss (error) for the proposed model, the AutoRec model and the NNMF method were 1.0868, 1.4758 and 1.1293, respectively. Such results showed that the proposed method outperformed other methods and can provide better recommendation performance.
4.6. Discussion
The above experiments evaluated our approach for a service robot to provide knowledge services. As presented, in our current design, the emotion recognition was constructed separately from the dialogue modeling. The model was trained by a data-driven process with a static dataset. The emotion classifier was then used to re-rank the sentences selected by the model. The separation of emotion recognition and dialogue modeling has several advantages. The first is that the modules of emotion recognition and response generation can be constructed by any effective methods if available; the system thus operates more flexibly. Meanwhile, the reasons why the system generated these responses can be interpretable to users for further analysis. The two subsystems can be integrated into one model to optimize the corresponding structure and performance, for example, to adopt a monolithic model with an attention mechanism to capture emotion as a special context. However, the integrated system may thus become relatively difficult to understand and computationally expensive.
Considering the dialogue modeling, this work trained models by a data-driven process with a static dataset. Therefore, in addition to the learning method, the quality and quantity of the dataset also had influences on the overall performance. It was thus important to strengthen the role of knowledge (i.e., dataset) to infer an enriched domain-specific model. Different strategies can be developed to exploit more knowledge resources, ranging from directly linking the dataset to the up-to-date external knowledge bases, reorganizing the dataset to obtain an optimized data use and to a complicated procedure of transferring knowledge between different domains. For a resource-restricted language, a straightforward way is to take the translated datasets as shared knowledge for modeling. We showed the effect of using translated knowledge. Our application case revealed that the translated knowledge was learnable and the modeling performance could be kept at a similar level as when using the original data. More advanced language translation techniques can be developed to further improve the performance.
In contrast to the non-task-oriented dialogues, a task-oriented dialogue has a specific goal to achieve. The dataset is more focused and thus relatively smaller. In such a system, the most critical component is the goal tracker that is used to track the user’s intent during the dialogue to infer the dialogue state. The system can then decide the best response accordingly to achieve the goal (e.g., the recommendations in our experiments). Through the application presented in this work, we have demonstrated that task-oriented dialogues can be practically launched for a manageable task with a clear goal and a constrained dataset. During such dialogues, a fine-tuning procedure for the model parameters needs to be carefully performed to find the best results. When the task becomes complicated or has a high-level (or abstract) goal to achieve, more advanced state tracking and an inferring mechanism is needed to better understand the users’ intentions.