1. Introduction
Posture recognition is a technology that classifies and identifies the posture of a person and has received considerable attention in the field of computer vision. Posture is the arrangement of the body skeleton that arises naturally or mandatorily through the motion of a person. Posture recognition helps detect crimes, such as kidnapping or assault, using camera input [
1] and also provides a service robot with important information for judging a situation to perform advanced functions in an automatic system. Posture recognition also helps rehabilitate posture correction in the medical field, is used as game content in the entertainment field, and provides suggestions to athletes for maximizing their ability in sports [
2,
3]. Additionally, it helps elderly people who live alone and have difficulty performing certain activities, by determining sudden danger from their posture in home environments. Thus, there are several studies on posture recognition because it is an important technique in our society.
Several studies have been performed on posture analysis, involving recognition, estimation, etc. Chan [
4] proposed the scalable feedforward neural network-action mixture model to estimate three-dimensional (3D) human poses using viewpoint and shape feature histogram features extracted from a 3D point-cloud input. This model is based on mapping that converts a Bayesian network to a feedforward neural network. Veges [
5] performed 3D human pose estimation with Siamese equivariant embedding. Two-dimensional (2D) positions were detected, and then the detection was lifted into 3D coordinates. A rotation equivariant hidden representation was learned by the Siamese architecture to compensate for the lack of data. Stommel [
6] proposed the spatiotemporal segmentation of keypoints given by a skeletonization of depth contours. Here, the Kinect generates both a color image and a depth map. After the depth map is filtered, a 2D skeletonization of the contour points is utilized as a keypoint detector. The human detection to a 2D clustering problem is simplified by the extracted keypoints. For all poses, the distances to other poses were calculated and arranged by similarity. Shum [
7] proposed a method for reconstructing a valid movement from the deficient, noisy posture provided by Kinect. Kinect localizes the positions of the body parts of a person. However, when some body parts are occluded, the accuracy decreases, because Kinect uses a single depth camera. Thus, the measurements are objectively evaluated to obtain a reliability score for each body part. By fusing the reliability scores into a query of the motion database, kinematically valid similar postures are obtained. Commonly posture recognition is studied using the inertial sensor in a smart phone or other wearable devices. Lee [
8] performed automatic classification of squat posture using inertial sensors via deep learning. One correct and five wrong squat postures were defined and classified using inertial data from five inertial sensors attached on the lower body, random forest, and convolutional neural network long short-term memory. Chowdhury [
9] studied detailed activity recognition with a smartphone using trees and support vector machines. The data from the accelerometer in the smartphone were used to recognize detailed activity, like sitting on a chair not simply sitting. Wu [
10] studied yoga posture recognition with wearable inertial sensors based on a two-stage classifier. The backpropagation artificial neural network and fuzzy C-means were used to divide yoga postures. Idris [
11] studied human posture recognition using an android smartphone and artificial neural network. The gyroscope data from two smartphones attached to the arm were used to classify four gestures. To acquire data from inertial sensors or smartphones, a sensor usually needs to attach to the body, which is inconvenient for elderly people at home.
The development of deep learning has greatly exceeded the performance of machine learning; thus, deep learning is actively studied. Various models and learning methods have been developed. The depth of deep learning networks has expanded from tens to hundreds [
12], in contrast to conventional neural networks with a depth of two to three. Deep learning networks abstract data to a high level through a combination of nonlinear transforms. There are many deep learning methods, such as deep belief networks [
13], deep Q networks [
14], deep neural networks [
15], recurrent neural networks [
16], and convolutional neural networks (CNNs) [
17]. CNNs are designed by integrating a feature extractor and a classifier into a network to automatically train them through data and exhibit the optimal performance for image processing [
18]. There are many pre-trained deep models based on the CNN, such as VGGNet [
19], ResNet [
20], DenseNet [
21], InceptionResNet [
22], and Xception [
23]. These pre-trained models can be used for transfer learning. Transfer learning reuses the parameters of the models, which are trained from a large-scale database using high-performance processing units for a long time. Transfer learning can be effectively used in cases where a neural network needs to be trained with a small database and there is insufficient time to train the network with new data [
24].
There are many studies on posture estimation based on deep learning. Thompson [
25] used a CNN to extract several scale features of body parts. These features include a high-order spatial model sourced from a Markov random field and indicate the structural constraint of the domain between joints. Careira [
26] proposed a feedback algorithm employing a top-down strategy called iterative error feedback. It carries the learning hierarchical representation of the CNN from the input to the output with a self-modifying model. Pishchulin [
27] concentrated on local information by reducing the receptive field size of the fast R-CNN (Region based CNN) [
28]. Thus, partial detection was converted into multi-label classification and combined with DeepCut to perform bottom-up inference. Insafutdinov [
29] proposed residual learning [
20] that includes more context by increasing the size of the receptive field and the depth of the network. Georgakopoulos [
30] proposed a methodology for classifying poses from binary human silhouettes using a CNN, and the method was improved by image features based on modified Zernike moments for fisheye images. The training set is composed of a composite image created from a 3D human model using the calibration model of the fisheye camera. The test set is the actual image acquired by the fisheye camera [
31]. To increase performance, many studies have used ensemble deep models at various applications. Lee [
32] designed an ensemble stacked auto-encoder based on sum and product for classifying horse gaits using wavelet packets from motion data of the rider. Maguolo [
33] studied an ensemble of convolutional neural networks trained with different activation functions using sum rule to improve the performance in small- or medium-sized biomedical datasets. Kim [
34] studied deep learning based on 1D ensemble networks using an electrocardiogram for user recognition. The ensemble network is composed of three CNN models with different parameters and their outputs are combined into single data.
Traditionally, to recognize posture, it was necessary to obtain the coordinates of the body points or inertial data. This was achieved using a depth camera such as Kinect, image processing through a body model, or devices for capturing motion connected to the body; regarding the latter, it is a nuisance to wear these sensors with care in everyday life. Posture recognition, using images that do not require a sensor attached to the body, does not have this problem. Since posture recognition is performed using images, it can be applied to inexpensive cameras, and the device used for acquiring experimental data also has an inexpensive feature even though it supports a depth camera. In conventional machine learning, there is a limitation to recognizing posture directly using only an image [
12,
35,
36,
37]. However, owing to advancements in deep learning, good performance in posture recognition can be achieved using only one image. In the present study, several pre-trained CNNs were employed for recognizing the posture of a person using various preprocessed 2D images. To train the deep neural network for posture recognition, a large number of posture images was required. Therefore, a daily domestic posture database was directly constructed for posture recognition under the assumption of an environment of domestic service robots. The database includes ten postures: “standing normal”, “standing bent”, “sitting sofa”, “sitting chair”, “sitting floor”, “sitting squat”, “lying face”, “lying back”, “lying side”, and “lying crouched”. The training, validation, and test sets were real captured images, not synthetic images. Moreover, ensemble CNN models were studied to improve performance. The performances of posture recognition using ensemble CNNs with various types of preprocessing, not studied thus far, were compared. In the case of single models, type 2 exhibited 13.63%, 3.12%, 2.79%, and 0.76% higher accuracy than types 0, 1, 3, and 4, respectively, under transfer learning; and VGG19 exhibited 15.78%, 0.78%, 3.53%, 4.11%, 10.61%, and 16.02% higher accuracy than the simple CNN, VGG16 [
19], ResNet50 [
20], DenseNet201 [
21], InceptionResNetV2 [
22], and Xception [
23], respectively, under transfer learning. In the case of ensemble systems, the ensemble system combining InceptionResNetV2s using average of scores was 2.02% higher than the best single model in our experiments under nontransfer learning. We performed posture recognition that can be applied to general security cameras to detect and respond to human risks. To do so, we acquired a large amount of data for training a neural network. In order to use the existing CNN models with fixed input form, various methods were applied to fit the input form. We compared the performance of posture recognition by applying various existing CNN models, and proposed ensemble models by input forms and CNN models.
We performed posture recognition using ensemble preconfigured deep models in home environments.
Section 2 describes the deep models of the CNN.
Section 3 discusses the database and experimental methods.
Section 4 presents the experimental results, and
Section 5 concludes the paper.