Next Article in Journal
The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures
Next Article in Special Issue
Two-Level Attentions and Grouping Attention Convolutional Network for Fine-Grained Image Classification
Previous Article in Journal
Biomechanical Evaluation of the Effect of Mesenchymal Stem Cells on Cartilage Regeneration in Knee Joint Osteoarthritis
Previous Article in Special Issue
Medical Image Segmentation with Adjustable Computational Complexity Using Data Density Functionals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach

1
Interdisciplinary Graduate School of Agriculture and Engineering, University of Miyazaki, Miyazaki 889-2192, Japan
2
Graduate School of Engineering, University of Miyazaki, Miyazaki 889-2192, Japan
3
International Relation Center, University of Miyazaki, Miyazaki 889-2192, Japan
*
Author to whom correspondence should be addressed.
Appl. Sci. 2019, 9(9), 1869; https://doi.org/10.3390/app9091869
Submission received: 31 March 2019 / Revised: 27 April 2019 / Accepted: 5 May 2019 / Published: 7 May 2019
(This article belongs to the Special Issue Advanced Intelligent Imaging Technology)

Abstract

:

Featured Application

This research can be applied to the abnormal behavior detection system for the elderly by analyzing daily activities.

Abstract

Nowadays, with the emergence of sophisticated electronic devices, human daily activities are becoming more and more complex. On the other hand, research has begun on the use of reliable, cost-effective sensors, patient monitoring systems, and other systems that make daily life more comfortable for the elderly. Moreover, in the field of computer vision, human action recognition (HAR) has drawn much attention as a subject of research because of its potential for numerous cost-effective applications. Although much research has investigated the use of HAR, most has dealt with simple basic actions in a simplified environment; not much work has been done in more complex, real-world environments. Therefore, a need exists for a system that can recognize complex daily activities in a variety of realistic environments. In this paper, we propose a system for recognizing such activities, in which humans interact with various objects, taking into consideration object-oriented activity information, the use of deep convolutional neural networks, and a multi-class support vector machine (multi-class SVM). The experiments are performed on a publicly available cornell activity dataset: CAD-120 which is a dataset of human–object interactions featuring ten high-level daily activities. The outcome results show that the proposed system achieves an accuracy of 93.33%, which is higher than other state-of-the-art methods, and has great potential for applications recognizing complex daily activities.

1. Introduction

The recognition of complex daily activities and human–object interaction plays as an important role in many applications, such as monitoring systems for the elderly, for patients, for human–robot interaction, and other video surveillance systems. For monitoring the elderly living independently, monitoring systems must automatically analyze daily activities and detect abnormal behavior in order to provide assistance health-care services. Although some techniques have been developed for monitoring the elderly using wearable sensors, these devices can be a source of mental and physical discomfort. Therefore, research has concentrated on computer vision-based human action recognition (HAR). In this area, depth sensors have gained much attention because of their reasonable cost and adaptability to variable illumination. Depth sensors, such as Microsoft Kinect [1] and ASUS Xtion Pro [2], can capture various kinds of data, such as depth images, RGB (red, green, blue) images, infrared images, and skeletal joint information for the human body.
Moreover, humans interact daily with various kinds of objects in different ways, depending on their intentions. Human–object interaction is complex, and recognizing the actions involved is a challenging task. Research into object recognition shows that a deep-learning approach achieves superior performance over other state-of-the-art techniques. Deep learning is also achieving more and more success in HAR research. This paper discusses the application of the deep-learning technology in recognizing human–object interaction. In addition, for improving results, we have built a multi-class support vector machine (multi-class SVM) using the object usage probability (OUP), which is the probability of how many times the objects have been used. Our technique involves a fusion of the results of deep learning and multi-class SVM in the final heuristics involved in interaction recognition (decision fusion). In this study, experiments were performed on the CAD-120 dataset [3] of human–object interaction, which used a depth sensor as an input device for collecting the data. This paper comprises five sections. The first describes a brief overview of the understanding of human–object interaction. In the second section, a case study is analyzed. A new approach for the recognition of human–object interactions is described in the third section. The experimental results of the proposed system are presented in the fourth section. Some discussion and conclusions are drawn in the fifth and sixth sections.

2. Related Works

In the field of computer vision research, various approaches have been proposed to solve the issue of recognizing complex human daily activities. This section describes some related research into the recognition of human–object interaction.
The authors of paper [4] proposed the anticipation of human intentional actions using the affordance of objects and the context of scenes for visualizing possible future actions. The experiments were performed using a Sez3D sensor. This system can predict future action when the frame observation range is between 30% and 60% of the whole action. Moreover, the authors of paper [5], proposed a system of robotic assistants that can anticipate what humans will do next using observations of pose and the surrounding environment, with the purpose of helping people with reactive response. Future actions are represented using the anticipatory temporal conditional random field (ATCRF), which is a model that can maintain a rich context of spatio-temporal relations via object affordances. Alternatively, the authors of paper [6] presented a method for predicting future actions from partially observed RGB-D (red, green, blue, depth) videos. Because of the rich context between humans and the environment while performing actions, the authors used a stochastic grammar model in order to capture the compositional structure of events and to integrate human actions with the corresponding objects and their affordances. In addition, the human–object–object (HOO) interaction affordance learning approach for improving the reliability of object recognition has also been proposed [7]. The relationship between a pair of objects is represented by a Bayesian network, which is then trained for the purpose of improving the reliability of object recognition. Moreover, a system using deep learning based on the affordance model has been proposed in [8] for recognizing human intentions and recommending objects for use. The action–object affordances were modeled using deep structure and gaze information obtained from a Tobii 1750 eye-tracker. This system is used to recognize human intentions and suggest the objects considered useful for the recognized intention.
Moreover, Koppula et al. proposed research on learning human activities and object affordances from RGB-D videos using a structural support vector machine (SSVM) approach [9]. However, the proposed method only achieved an accuracy of 75.0% for high-level activity recognition. In subsequent research by Koppula et al., the spatio-temporal relationship between human poses and objects was modeled using a conditional random field for anticipating activities [10]. In the latter study, the accuracy of detecting high-level activities portrayed in the CAD-120 dataset increased to 83.1%, which is still lower than with our proposed system. In addition, the authors in [11] proposed a two-layer SVM hidden conditional random field (HCRF) recognition model for recognizing daily activities, specifically for those involving human–object interaction. However, this method relies on learning sub-activities based on the temporal sub-structure of the interaction for recognizing the high-level activities using a hierarchical SVM-HCRF model. Last but not least, a long short-term memory network was developed for recognizing the behavior of baseball players [12]. This network features the fusion of data from multiple sensors.
Some researchers have applied deep learning for recognizing single person actions, such as walking, sitting, standing, etc. Baccouche et al. [13] introduced a method for recognizing human actions by utilizing deep learning of spatio-temporal features. In this work, the authors extended a 2D convolutional neural network architecture to a 3D convolutional neural network (3D-ConvNet) architecture by adding the temporal dimension. Moreover, Liu et al. [14] proposed an approach that can be directly applied to raw depth video sequences for extracting spatio-temporal features using a support vector machine (SVM) for the classification of actions. A HAR based on deep-learning technology using skeleton images of human actions as input data was proposed in [15]. In addition, the authors of [16] developed a system incorporating enhanced images for a skeleton motion history, as well as a HAR system based on images of the relative positions of joints which can work independently on the problem domain.
Most of the HAR systems use RGB-D video, the data of eye-tracking and acceleration sensors, and skeletal tracking data for performing the experiments. To the best of our knowledge, even though much research has been done on HAR, it still remains a challenge for implementing HAR that can accurately recognize activities using a simple and robust approach, especially for the activities involving human–object interaction. Therefore, in this paper, we propose a hybrid approach for recognizing activities of human–object interactions in daily life.

3. Proposed System

In this paper, we propose a system for recognizing complex human–object interaction based on usage information for the objects involved. For this purpose, we apply a deep convolutional neural network (DCNN) over the spatio-temporal features extracted from information on human joints and the objects. We also apply a multi-class support vector machine (multi-class SVM) over the object usage probability (OUP) features in order to apply probability information for humans interacting with each object. The architecture of the proposed system is shown in Figure 1. The proposed system’s architecture consists of: (i) input data acquisition, (ii) temporal segmentation, (iii) creating the input data for DCNN, (iv) training and recognizing interactions using DCNN, (v) extracting the object usage probability (OUP), (vi) training and recognizing interactions based on OUP by using a multi-class SVM, and finally (vii) a decision based on fusing the results of DCNN and multi-class SVM to produce the final result for recognizing human–object interactions. The main contributions and significant differences between this proposed system and our previous works [14,15] are as follows:
  • Recognition of human–object interactions in which humans interact with different objects in order to complete desired tasks, such as making cereal or microwaving food.
  • Creation of input data for a DCNN, which can accurately represent interactions between humans and objects.
  • Calculation of object usage probability (OUP) and training OUP using a multi-class SVM to improve the performance of recognizing the human–object interactions.
  • Fusing the result of DCNN and multi-class SVM (decision fusion) for generating better and more accurate results.

3.1. Input Data Acquisition

In the proposed system for recognizing complex human–object interaction, we use RGB images, as well as depth and human skeleton tracking data generated by depth cameras. In order to confirm validity of the proposed system, the experiments were performed on the publicly available human–object interaction dataset of CAD-120 [3]. The structures of skeletal joints and their descriptions used for the experiments are illustrated in Figure 2.

3.2. Temporal Segmentation

We performed temporal segmentation over the input data to obtain groups of data representing human motion and object interaction patterns for each activity. This process groups the nearest frames into one segment, allowing extraction of changes in both spatial and temporal dimensions. Temporal segmentation is important because poor temporal segmentation often results in poor results for interaction recognition. For example, if the time duration threshold used in temporal segmentation is too short, features will not be well represented, and if the threshold is too long, distinguishing features within a set of interactions will be difficult to obtain. Here, we use a time duration threshold (a temporal sliding window size) of 15 frames over the input data with a frame rate of 15 fps. Therefore, each segment provides a good representation of changing interaction patterns within 1 second. The input data are uniformly segmented using a fixed temporal sliding window size of W. For action data with a total number of frames N which cannot be divided by 15, we replicate the first frame into R times that can be calculated using Equation (1).
R = 15 ( N   mod   15 )
where mod refers to the modulus operation for finding the total number of remaining frames R in order to perform uniform segmentation. Figure 3 shows an example of temporal segmentation over the skeletal movement data of the right shoulder, right elbow, and right hand while “taking action using the right hand”.

3.3. Creation of Input Data for a Deep Convolutional Neural Network

3.3.1. Extraction of Skeletal Joint Movement Features

In order to extract spatio-temporal features, we first detected the moving joints within 15 frames. Next, we calculated the Euclidean distance between two consecutive frames for all joints using Equation (2). Then we found the mean and the standard deviation for joint distance as described in Equations (3) and (4).
Dist i , j = ( x i + 1 , j x i , j ) 2 + ( y i + 1 , j y i , j ) 2    for   i =   1 ,   2 ,   ,   T   and   j =   1 ,   2 ,   ,   S
m j = 1 T i = 1 T Dist i , j         for   i =   1 ,   2 ,   ,   T   and   j =   1 ,   2 ,   ,   S
sd j = 1 T i = 1 T ( Dist i , j m j )       for   i =   1 ,   2 ,   ,   T   and   j =   1 ,   2 ,   ,   S
where the value of S is 15 for total skeletal joints, and the value of T is 14, which represents W − 1 (the frame interval within the temporal sliding window W). Disti,j refers to the Euclidean distance between the locations of joint j in two consecutive frames i and i + 1. The symbols of mj and sdj represent the mean and the standard deviation for joint j. If the value of Disti,j is greater than or equal to the value of (mj + sdj × 0.5), then joint j at time i is regarded as a moving joint and its movement frequency (freqj) increases. RGB value markers are predefined as shown in Table 1, and the values of i, j, and freqj for moving joints are used as indices for selecting RGB color values. Sample data for Disti,j, mj, sdj, and freqj for the sliding window for the action “taking using the right hand” are described in Table 2. After obtaining the locations of moving joints, we assigned a circular shape marker for the location of moving joint j in each frame with time i for creating input images for DCNN. We can see that values for freqj for the right shoulder, right elbow, and right hand are higher than those for other joints, indicating the validity of this process of detecting moving joints (highlighted with yellow color).

3.3.2. Creation of an Object Representation Image

In human–object interaction, humans interact with various kinds of objects in different ways. In real-world applications, the variety of objects with which humans interact is quite large, even though they might be in the same object category. For example, items used for cooking or taking medicine can vary. Therefore, for robustly inserting object information into input images, object representation images (ORI) must be created and combined with input data for a DCNN that provides a good representation of each object category. In this system, we use color values for object representation. We applied ten color values for the ten object categories because the dataset on which we performed experiments included ten objects with different designs, namely a medicine-box, microwave, remote, milk, plate, cloth, bowl, book, cup, and box. The color representation of each object category is described in Table 3. After defining the color representation for each object category, we added the corresponding color value to each DCNN input image by filling the object regions (within the bounding boxes) provided in the CAD-120 dataset [3] with that color. The result of inserting ORI is shown in Figure 4. This process can provide information on the region connection state (RCS), which can express the graphically connected region for each object. This kind of information is very useful for recognizing human–object interaction, such as the interaction involved in microwaving food or pouring milk into a bowl of cereal. These interactions include overlapping regions between objects. Figure 5 provides illustrations of three kinds of RCS involving a microwave, bowl, milk bottle, cup, and medicine box.

3.4. Interaction Recognition using a Deep Convolutional Neural Network (DCNN)

Deep convolutional neural networks (DCNN), also known as deep learning, are a kind of neural network that includes grid-type operations of convolution and pooling over grid-type input data such as images. The main function of convolution operations is to extract key information from raw input data through multiplication in a predefined kernel matrix. Pooling operations are used for reducing the dimensions of each DCNN output layer. Pooling can be performed by the mathematical operations of taking averages or maximum values for the data which exist within the defined matrix. In every DCNN layer, a decision must be made on whether the results of convolution or pooling should be produced as output and sent to the next layer. This decision-making process is performed by the activation function. Various activation functions have been discussed in the literature on DCNN, including sigmoid, hyperbolic tangent, and rectified linear units (ReLU). Each function has its own properties. For example, the output of the sigmoid activation function is within the range of 0 and 1, and the output of the hyperbolic tangent function is from −1 to 1. In the proposed system, the rectified linear unit (ReLU) activation function was applied in all DCNN layers. ReLU produces original input values as outputs if those values are greater than 0 and produces 0 if input values are less than 0.
The most recent DCNN research indicates that DCNN achieves superior performance in visual object and pattern recognition. Therefore, we applied DCNN in the proposed system for recognizing patterns of human motion and object interaction. As shown in Figure 6, the DCNN architecture comprises three convolution (Conv) layers and three pooling (Pool) layers, followed by one fully connected (F) layer and an output (O) layer. The total number of output neurons for each DCNN layer was 32, 64, 64, and 10. In addition, each layer was followed by a drop-out (D) layer for minimizing the data overfitting problem, using drop-out ratios of 1%, 2%, 3%, and 4%. Convolution operations were performed in three hidden layers using kernel sizes of 7 × 7, 5 × 5, and 3 × 3. For pooling operations, 2 × 2 kernels were used. Then, one fully connected layer was applied before the output layer in order to combine DCNN features in a one-dimensional vector with a length of 64. The next layer was an output layer with ten neurons for recognizing ten human–object interactions. For transforming the results for DCNN output layers into corresponding probability values (P_DCNN), a Soft-Max operation was applied. Weight initialization for all layers was done using the MSRA method, which was developed by Microsoft Research Asia (MSRA) [17]. The stochastic gradient descent algorithm [18] was used for weight updating, and the whole network was constructed using the Matcaffe deep-learning framework [19,20].

3.5. Extracting the Object Usage Probability

Complex human daily activities consist of many continuous sub-actions. Therefore, the application of fixed-size segmentation can cause a mixing of such continuous sub-actions. However, the proposed system can overcome the problem of mis-recognizing actions due to the mixing features in continuous sub-actions by applying object usage probability (OUP). In daily life, humans use various kinds of objects for accomplishing tasks. Therefore, activeness information for each object involved in a specific set of human–object interactions is very useful in establishing a recognition system for complex human–object interaction. In our proposed system, we consider an object to be in a using state when a human hand reaches for the object and then moves it. Let us denote the total number of frames that include the usage of object o during interaction k as Co,k, and the total number of frames in interaction k as TFk, then the OUP of object o for interaction k can be calculated using Equation (5). The probability that no interaction occurs with objects within interaction k is denoted as nullk, which can be calculated using Equation (6). The usage counts for the sample object and OUP data for ten actions in the CAD-120 dataset are shown in Table 4 and Table 5.
OUP o , k = C o , k T F k       for   k   =   1 ,   2 ,   . . . ,   10   and   o =   1 ,   2 ,   ,   10
n u l l k = 1 o = 1 10 OUP o , k     for   k   =   1 ,   2 ,   . . . ,   10   and   o =   1 ,   2 ,   ,   10

3.6. Training and Recognizing Interactions Based on OUP Using a Multi-Class Support Vector Machine

Support vector machines (SVM) are supervised learning algorithms which are very popular for visual pattern recognition. SVMs were originally designed for binary classification and are also called linear SVMs. The main process involved in a linear SVM is for finding the linear optimal separating hyperplane, which is the decision boundary separating the various classes of data. Linear SVMs use support vectors, which are the training data, and margins are defined by these support vectors. During the training step, a SVM is used to find the maximum margin hyperplane which gives the largest separation between class. For multi-class SVM classification, the one-against-one classification method is used in training for each action. Therefore, a one-against-one class design coded into multi-class SVM yields M binary learners for M classes. The output of each SVM (separating hyperplane) can be written as
y k ( X ) = v = 1 N w v k · x v + b
where yk is the output of the SVM for the interaction k, X = {x1, x2, …, xn}, which is the feature vector containing N features. w v k is the weight vector of the SVM of interaction k and features v, and b is a scalar value for bias. In the proposed system, we trained the multi-class SVM using OUP data for implementing the human–object recognition system as shown in Figure 7. In the training phase, the OUP were calculated based on the object information of training video sequences, and then used as feature vectors for training multi-class SVM. In the testing phase, multi-class SVM produced classified interactions together with confidence scores. Later, we fused the confidence scores from the multi-class SVM, which were the probability values for M possible actions (P_SVM), together with the results of DCNN to establish a robust system for recognizing human–object interaction.

3.7. Decision Fusion and Human–Objects Interaction Recognition

After obtaining probability values for all ten interactions from DCNN and OUP based multi-class SVM, we performed the decision fusion operation for accurately obtaining decision results in the recognition of human–object interaction. The decision fusion process was simply performed by averaging the probability values for all interactions and using the class with the highest probability. The mathematical expression for the decision fusion process is shown in Equations (8) and (9). In this way, mis-recognized actions obtained using DCNN can be correctly recognized by the OUP-based multi-class SVM, and vice versa.
Recognized Interaction = Max(Pk)  for k = 1, 2, …, 10
P k   =   ( P _ DCNN k   +   P _ SVM k   ) 2        for   k =   1 ,   2 ,   ,   10
where P_DCNNk, P_SVMk, and Pk refer to the probability of interaction k obtained using the DCNN, multi-class SVM, and decision fusion process.

4. Experimental Results

The experiments were performed on the CAD-120 human daily activities dataset which consisted of ten high-level activities performed by four different subjects: making cereal, taking medicine, stacking objects, unstacking objects, microwaving food, picking objects, cleaning objects, taking food, arranging objects, and having a meal. The data consisted of RGB images, depth images and skeleton joint coordinates, object locations, and object types for each high-level activity. This data can be downloaded from the following URL: http://pr.cs.cornell.edu/humanactivities/data.php. The sample RGB images in the CAD-120 dataset are shown in Figure 8. The performance of the proposed system was evaluated by the leave-one-subject-out-cross-validation method, which uses three subjects’ data as training data, and data for the other subject as test data. Therefore, we alternately trained the DCNN and multi-class SVM using data from three different subjects, and used data from the remaining subject as test data. We performed these experiments four times by alternately training and testing using data from four subjects. For accurately recognizing the interactions involved in “stacking objects” and “unstacking objects”, we added some rules based on spatial features of the objects, because the obvious difference between those two interactions is the spatial feature of (total width) at the start and end times of the interactions. In the case of “stacking objects”, the width of all objects at the beginning is larger than that at the end. However, the opposite is the case for the “unstacking objects”, as shown in Figure 9. The graphical form of the result of recognizing “making cereal” interaction is shown in Figure 10. A detailed confusion matrix for recognizing the actions in the CAD-120 dataset is shown in Table 6.
We calculated the overall accuracy (Recognition Rate) using the following Equation (10).
Recognition   Rate = No . of   correctly   recognized   actions No . of   total   actions   in   the   dataset × 100
As shown in Table 6, the proposed system correctly recognized most actions with high recognition accuracy. The recognition accuracy for taking food from a microwave was low because it closely resembles microwaving food. The interaction involved in picking objects was mistaken for interactions such as those involved in taking food, arranging objects, and having a meal, because object-interaction information was comparatively less for these interactions. As described in Table 7, we also compared our results with those of other state-of-the-art recognition methods that were performed on the same dataset. We can see that the proposed system outperformed the method proposed by Koppula et al. [9] by a significant margin of 12.73%. We also improved accuracy by a margin of 10.23% when compared with the method proposed by Koppula and Saxena [10]. Moreover, our system achieved the comparable accuracy that was 3.03% higher than the method proposed in [11]. Due to the highly insensitive nature of the input data property of deep learning, we believe that our proposed system achieves higher accuracy, and is more robust and efficient in recognizing complex daily activities in real-world applications.

5. Conclusions

In this paper, we propose a recognition system for complex human–object interactions based on the hybrid approach of combining DCNN and multi-class SVM. We also propose a new feature called object usage probability (OUP) which is highly effective in recognizing human–object interaction. By applying this hybrid approach, the performance of the proposed system has been improved. Moreover, for the purpose of recognizing interactions performed by different people in real-world applications, we use the leave-one-subject-out-cross-validation method for performance evaluation. Using this validation method, after applying a rule based on the spatial features of the objects, the proposed system achieved an overall accuracy of 93.33%, which is higher than that of other state-of-the-art methods.

6. Discussion and Future Work

In the proposed system, we used location information for ground-truth objects in order to create DCNN input images, and to calculate OUP. In the future, we will automate the process of detecting and classifying the objects using deep-learning technology, making the whole system more automatic. In the automatic detection and classification of objects, we will perform experiments with complex backgrounds which have similar color with the associated objects. We will also work on improving recognition accuracy for taking food from a microwave by considering more information that can differentiate this action from the interaction involved in similar activities. The proposed system has only been tested using offline data. Therefore, we will perform an analysis of motion trajectories for each interaction using geometrical and statistical models for recognizing and predicting online action data in combination with DCNN. Finally, we will record a variety of more complex human–object interaction videos under various environmental conditions and more complex backgrounds and perform more experiments.

Author Contributions

Methodology, C.N.P.; Supervision, T.T.Z. and P.T. The major portion of work presented in this paper was carried out by the first author C.N.P., under the supervision of the second author T.T.Z. The third author P.T. provided valuable advice on mathematical concepts. Both T.T.Z. and P.T. gave suggestions for the preparation and revision of this paper; C.N.P. devised the methodology and performed the experiments; all three authors analyzed the experimental results.

Funding

This work is partially supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI (Grant No. 17K08066).

Acknowledgments

The authors would like to express sincere thanks to Hiromitsu Hama who is an emeritus Professor at Osaka City University, Japan for his kind and valuable suggestions and advice during this research. The authors also would like to thank in advance reviewers and editors of this special issue for all suggestions and comments for improving this research paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Microsoft Kinect. 2013. Available online: https://developer.microsoft.com/en-us/windows/kinect (accessed on 1 June 2017).
  2. ASUS Xtion PRO LIVE. 2013. Available online: https://www.asus.com/3D-Sensor/Xtion_PRO/ (accessed on 28 October 2017).
  3. Cornell Activity Dataset. Available online: http://pr.cs.cornell.edu/humanactivities/data.php (accessed on 1 March 2018).
  4. Dutta, V.; Zielinska, T. Predicting Human Actions Taking into Account Object Affordances. J. Intell. Robot. Syst. 2018, 93, 745–761. [Google Scholar] [CrossRef]
  5. Koppula, H.S.; Saxena, A. Anticipating human activities using object affordances for reactive robotic response. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 14–29. [Google Scholar] [CrossRef] [PubMed]
  6. Qi, S.; Huang, S.; Wei, P.; Zhu, S.C. Predicting human activities using stochastic grammar. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1173–1181. [Google Scholar]
  7. Ren, S.; Sun, Y. Human-object-object-interaction affordance. In Proceedings of the 2013 IEEE Workshop on Robot Vision (WORV), Clearwater Beach, FL, USA, 15–17 January 2013; pp. 1–6. [Google Scholar]
  8. Kim, S.; Kavuri, S.; Lee, M. Intention recognition and object recommendation system using deep auto-encoder based affordance model. In Proceedings of the 1st International Conference on Human-Agent Interaction, II-1-2, Sapporo, Japan, 7–9 August 2013; pp. 1–6. [Google Scholar]
  9. Koppula, H.S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from rgb-d videos. Int. J. Robot. Res. 2013, 32, 951–970. [Google Scholar] [CrossRef]
  10. Koppula, H.; Saxena, A. Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 792–800. [Google Scholar]
  11. Selmi, M.; El-Yacoubi, M.A. Multimodal sequential modeling and recognition of human activities. In Proceedings of the International Conference on Computers Helping People with Special Needs, Linz, Austria, 13–15 July 2016; pp. 541–548. [Google Scholar]
  12. Sun, S.W.; Mou, T.C.; Fang, C.C.; Chang, P.C.; Hua, K.L.; Shih, H.C. Baseball Player Behavior Classification System Using Long Short-Term Memory with Multimodal Features. Sensors 2019, 19, 1425. [Google Scholar] [CrossRef] [PubMed]
  13. Baccouche, M.; Mamalet, F.; Wolf, C.; Garcia, C.; Baskurt, A. Sequential deep learning for human action recognition. In Proceedings of the International Workshop on Human Behavior Understanding, Amsterdam, The Netherlands, 16 November 2011; pp. 29–39. [Google Scholar]
  14. Liu, Z.; Zhang, C.; Tian, Y. 3D-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 2016, 55, 93–100. [Google Scholar] [CrossRef]
  15. Phyo, C.N.; Zin, T.T.; Tin, P. Skeleton motion history based human action recognition using deep learning. In Proceedings of the 2017 IEEE 6th Global Conference on Consumer Electronic (GCCE 2017), Nagoya, Japan, 24–27 October 2017; pp. 784–785. [Google Scholar]
  16. Phyo, C.N.; Zin, T.T.; Tin, P. Deep Learning for Recognizing Human Activities using Motions of Skeletal Joints. IEEE Trans. Consum. Electron. 2019, 65, 243–252. [Google Scholar] [CrossRef]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
  18. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  19. Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
  20. Caffe. Available online: http://caffe.berkeleyvision.org (accessed on 16 December 2017).
Figure 1. Architecture of a complex human–object interaction recognition system. DCNN: deep convolutional neural network; SVM: support vector machine.
Figure 1. Architecture of a complex human–object interaction recognition system. DCNN: deep convolutional neural network; SVM: support vector machine.
Applsci 09 01869 g001
Figure 2. Skeletal joints and their descriptions in the CAD-120 dataset.
Figure 2. Skeletal joints and their descriptions in the CAD-120 dataset.
Applsci 09 01869 g002
Figure 3. Illustration of temporal segmentation over the skeletal joint movement data.
Figure 3. Illustration of temporal segmentation over the skeletal joint movement data.
Applsci 09 01869 g003
Figure 4. Sample DCNN input images.
Figure 4. Sample DCNN input images.
Applsci 09 01869 g004
Figure 5. Three kinds of region connection states between microwave (cyan), bowl (pink), milk bottle (blue), cup (brown), and medicine box (green).
Figure 5. Three kinds of region connection states between microwave (cyan), bowl (pink), milk bottle (blue), cup (brown), and medicine box (green).
Applsci 09 01869 g005
Figure 6. DCNN architecture.
Figure 6. DCNN architecture.
Applsci 09 01869 g006
Figure 7. Illustration of training and testing phases of multi-class SVM using object usage probability (OUP) data.
Figure 7. Illustration of training and testing phases of multi-class SVM using object usage probability (OUP) data.
Applsci 09 01869 g007
Figure 8. Sample RGB images from the CAD-120 dataset.
Figure 8. Sample RGB images from the CAD-120 dataset.
Applsci 09 01869 g008
Figure 9. Illustration of spatial features between stacking and unstacking objects (First row: RGB image; Second row: DCNN input image; yellow line indicates total object width).
Figure 9. Illustration of spatial features between stacking and unstacking objects (First row: RGB image; Second row: DCNN input image; yellow line indicates total object width).
Applsci 09 01869 g009
Figure 10. Graphical form of the result of recognizing the making cereal interaction (each row describes the interactions performed by a different subject).
Figure 10. Graphical form of the result of recognizing the making cereal interaction (each row describes the interactions performed by a different subject).
Applsci 09 01869 g010
Table 1. RGB values of markers and their indices (i, j, and freqj).
Table 1. RGB values of markers and their indices (i, j, and freqj).
Time (i)1234567891011121314-
R value1836557391109128146164182200219237255-
Joint (j)123456789101112131415
G value1734516885102119136153170187204221238255
freqj1234567891011121314-
B value1836557391109128146164182200219237255-
Table 2. Calculation of joint movement frequency from skeletal joint distance data.
Table 2. Calculation of joint movement frequency from skeletal joint distance data.
Joint (j)123456789101112131415
Dist1,j0.550.690.730.70.890.72.180.740.740.771.10.094.810.721.22
Dist2,j0.590.640.620.620.570.652.020.630.630.661.031.895.410.641.2
Dist3,j0.970.770.690.830.670.721.970.5632.60.671.041.256.2152.81.22
Dist4,j1.110.910.865.931.172.20.7810.61.111.287.114.8120.51.44
Dist5,j0.860.850.790.894.650.851.750.765.740.70.984.734.336.481.21
Dist6,j0.781.111.311.152.241.082.051.059.812.062.332.142.6312.72.58
Dist7,j0.810.891.170.951.460.871.691.432.11.471.741.522.92.161.97
Dist8,j0.950.90.980.841.550.931.331.278.220.991.130.612.878.221.31
Dist9,j1.451.371.381.321.341.371.781.483.791.321.390.962.463.851.5
Dist10,j0.940.810.70.8120.91.620.633.330.650.761.961.813.330.93
Dist11,j1.090.880.730.622.021.241.230.412.641.0112.142.192.71.05
Dist12,j0.930.740.580.582.691.161.250.3118.11.081.063.191.96181.07
Dist13,j1.2610.670.912.221.261.630.147.840.690.742.661.217.850.81
Dist14,j1.471.230.821.211.061.411.070.325.250.830.841.70.795.220.95
mj0.980.920.860.882.091.021.70.757.9611.172.283.1710.41.32
sdj0.280.210.260.221.520.250.370.428.50.40.421.81.6613.70.46
mj + (sdj*0.5)1.121.020.990.992.851.151.880.9612.21.21.383.18417.21.55
freqj333326542333532
The joint movement frequency (freqj) greater than or equal to 5 are described in red color.
Table 3. Color representation of each object category in the CAD-120 dataset.
Table 3. Color representation of each object category in the CAD-120 dataset.
ObjectsRGBColorObjectsRGBColor
medicine box02550 cloth2550255
microwave0255255 bowl2550128
remote control0128255 book25500
milk00255 cup12800
plate1280255 box00128
Table 4. Sample data for object usage counts (C) for ten actions in the CAD-120 dataset.
Table 4. Sample data for object usage counts (C) for ten actions in the CAD-120 dataset.
Medicine BoxMicrowaveRemote ControlMilkPlateClothBowlBookCupBoxNullTotal Frames (TF)Interaction
kDescription
000137008700254425201making cereal
177000000013101634712taking medicine
0000000004361135493stacking objects
0000376000001134894unstacking objects
03150000000248856485microwaving food
000000113000461596bending
023500024700001115937cleaning objects
017900000011001154048taking food
0000000002521103629arranging objects
00000000193030649910having breakfast
Table 5. Object usage probability (OUP) sample data for ten actions in the CAD-120 dataset.
Table 5. Object usage probability (OUP) sample data for ten actions in the CAD-120 dataset.
Medicine BoxMicrowaveRemote ControlMilkPlateClothBowlBookCupBoxNullInteraction
0000.26000.17000.490.08making cereal
0.3800000000.2800.35taking medicine
0000000000.790.21stacking objects
00000.77000000.23unstacking objects
00.4900000000.380.13microwaving food
0000000.710000.29bending
00.40000.4200000.19cleaning objects
00.440000000.2700.28taking food
0000000000.70.3arranging objects
000000000.3900.61having breakfast
Table 6. Confusion matrix for recognizing interactions in the CAD-120 dataset.
Table 6. Confusion matrix for recognizing interactions in the CAD-120 dataset.
Recognized Actions
Making CerealTaking MedicineStacking objectsUnstacking ObjectsMicrowaving FoodPicking ObjectsCleaning ObjectsTaking FoodArranging ObjectsHaving a Meal
Actual Actionsmaking cereal100000000000
taking medicine010000000000
stacking objects0091.67000008.330
unstacking objects00091.6700008.330
microwaving food000091.67008.3300
picking objects0000083.33008.338.33
cleaning objects000000100000
taking food00008.338.33083.3300
arranging objects0000000091.670
having a meal000000000100
The percentage of correctly recognized actions in the CAD-120 dataset are highlighted in yellow color.
Table 7. Comparison of performance on the CAD-120 dataset.
Table 7. Comparison of performance on the CAD-120 dataset.
MethodRecognition Rate
Koppula et al. [9]80.6%
Koppula and Saxena [10]83.1%
Selmi et al. [11] 90.30%
Proposed System93.33%

Share and Cite

MDPI and ACS Style

Phyo, C.N.; Zin, T.T.; Tin, P. Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach. Appl. Sci. 2019, 9, 1869. https://doi.org/10.3390/app9091869

AMA Style

Phyo CN, Zin TT, Tin P. Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach. Applied Sciences. 2019; 9(9):1869. https://doi.org/10.3390/app9091869

Chicago/Turabian Style

Phyo, Cho Nilar, Thi Thi Zin, and Pyke Tin. 2019. "Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach" Applied Sciences 9, no. 9: 1869. https://doi.org/10.3390/app9091869

APA Style

Phyo, C. N., Zin, T. T., & Tin, P. (2019). Complex Human–Object Interactions Analyzer Using a DCNN and SVM Hybrid Approach. Applied Sciences, 9(9), 1869. https://doi.org/10.3390/app9091869

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop