sensors-logo

Journal Browser

Journal Browser

Deep Learning Applications for Pose Estimation and Human Action Recognition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (20 October 2024) | Viewed by 34556

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy
Interests: deep learning; machine learning; computer vision; depth estimation; attitude and pose estimation
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Civil and Environmental Engineering, University of Florence, Via S. Marta 3, 50139 Florence, Italy
Interests: navigation and positioning; attitude and pose estimation; 3D modeling; geomatics; sensors; deep learning; computer vision; climate change; cultural heritage preservation; remote sensing
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Computer, Control, and Management Engineering A. Ruberti, Sapienza University of Rome, 00185 Rome, Italy
Interests: multimedia forensics and security; machine learning; deep learning; computer vision
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In the last decade, deep learning has drawn significant attention thanks to its robustness and potential in generalization and learning capabilities. Several applications have been tested and successfully deployed, exploring the majority of real-world tasks with the aim of improving their performances. Among others, pose estimation and human action recognition have benefitted from the exceptional results achieved in the deep learning field, although still showing wide margins of improvement.

This Special Issue aims to gather a significant collection of original contributions to these topics. Accurate estimation of vehicles and humans pose is crucial for several applications, e.g., animal behavior research, gaming and virtual reality, medicine and biotechnology, pedestrian, aerial and maritime navigation, robotics, and human motion tracking. Furthermore, effective human pose and action recognition offers an important contribution in many fields, such as physical therapists’ diagnoses and patient rehabilitation, as well as security and surveillance or employee-free store development.

The relevant topics of this issue include but are not limited to the following:

  • Single and multihuman pose estimation, action recognition, and tracking;
  • Terrestrial, maritime, aerial robot pose estimation, and tracking;
  • Literature reviews and surveys;
  • Datasets and sensors;
  • Interesting applications and ideas focusing on surveillance, autonomous navigation, human–robot interaction, healthcare and sports, etc.

Dr. Paolo Russo
Dr. Fabiana Di Ciaccio
Dr. Irene Amerini
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • action recognition
  • pose estimation
  • human activities
  • robotics and intelligent systems
  • navigation
  • positioning
  • control
  • datasets
  • sensors
  • embedded systems and devices

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Related Special Issue

Published Papers (18 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 676 KiB  
Article
Efficient Limb Range of Motion Analysis from a Monocular Camera for Edge Devices
by Xuke Yan, Linxi Zhang, Bo Liu and Guangzhi Qu
Sensors 2025, 25(3), 627; https://doi.org/10.3390/s25030627 - 22 Jan 2025
Viewed by 378
Abstract
Traditional limb kinematic analysis relies on manual goniometer measurements. With computer vision advancements, integrating RGB cameras can minimize manual labor. Although deep learning-based cameras aim to offer the same ease as manual goniometers, previous approaches have prioritized accuracy over efficiency and cost on [...] Read more.
Traditional limb kinematic analysis relies on manual goniometer measurements. With computer vision advancements, integrating RGB cameras can minimize manual labor. Although deep learning-based cameras aim to offer the same ease as manual goniometers, previous approaches have prioritized accuracy over efficiency and cost on PC-based devices. Nevertheless, healthcare providers require a high-performance, low-cost, camera-based tool for assessing upper and lower limb range of motion (ROM). To address this, we propose a lightweight, fast, deep learning model to estimate a human pose and utilize predicted joints for limb ROM measurement. Furthermore, the proposed model is optimized for deployment on resource-constrained edge devices, balancing accuracy and the benefits of edge computing like cost-effectiveness and localized data processing. Our model uses a compact neural network architecture with 8-bit quantized parameters for enhanced memory efficiency and reduced latency. Evaluated on various upper and lower limb tasks, it runs 4.1 times faster and is 15.5 times smaller than a state-of-the-art model, achieving satisfactory ROM measurement accuracy and agreement with a goniometer. We also conduct an experiment on a Raspberry Pi, illustrating that the method can maintain accuracy while reducing equipment and energy costs. This result indicates the potential for deployment on other edge devices and provides the flexibility to adapt to various hardware environments, depending on diverse needs and resources. Full article
Show Figures

Figure 1

14 pages, 3595 KiB  
Article
HandFI: Multilevel Interacting Hand Reconstruction Based on Multilevel Feature Fusion in RGB Images
by Huimin Pan, Yuting Cai, Jiayi Yang, Shaojia Niu, Quanli Gao and Xihan Wang
Sensors 2025, 25(1), 88; https://doi.org/10.3390/s25010088 - 27 Dec 2024
Viewed by 547
Abstract
Interacting hand reconstruction presents significant opportunities in various applications. However, it currently faces challenges such as the difficulty in distinguishing the features of both hands, misalignment of hand meshes with input images, and modeling the complex spatial relationships between interacting hands. In this [...] Read more.
Interacting hand reconstruction presents significant opportunities in various applications. However, it currently faces challenges such as the difficulty in distinguishing the features of both hands, misalignment of hand meshes with input images, and modeling the complex spatial relationships between interacting hands. In this paper, we propose a multilevel feature fusion interactive network for hand reconstruction (HandFI). Within this network, the hand feature separation module utilizes attentional mechanisms and positional coding to distinguish between left-hand and right-hand features while maintaining the spatial relationship of the features. The hand fusion and attention module promotes the alignment of hand vertices with the image by integrating multi-scale hand features while introducing cross-attention to help determine the complex spatial relationships between interacting hands, thereby enhancing the accuracy of two-hand reconstruction. We evaluated our method with existing approaches using the InterHand 2.6M, RGB2Hands, and EgoHands datasets. Extensive experimental results demonstrated that our method outperformed other representative methods, with performance metrics of 9.38 mm for the MPJPE and 9.61 mm for the MPVPE. Additionally, the results obtained in real-world scenes further validated the generalization capability of our method. Full article
Show Figures

Figure 1

24 pages, 4616 KiB  
Article
Estimating a 3D Human Skeleton from a Single RGB Image by Fusing Predicted Depths from Multiple Virtual Viewpoints
by Wen-Nung Lie and Veasna Vann
Sensors 2024, 24(24), 8017; https://doi.org/10.3390/s24248017 - 15 Dec 2024
Viewed by 957
Abstract
In computer vision, accurately estimating a 3D human skeleton from a single RGB image remains a challenging task. Inspired by the advantages of multi-view approaches, we propose a method of predicting enhanced 2D skeletons (specifically, predicting the joints’ relative depths) from multiple virtual [...] Read more.
In computer vision, accurately estimating a 3D human skeleton from a single RGB image remains a challenging task. Inspired by the advantages of multi-view approaches, we propose a method of predicting enhanced 2D skeletons (specifically, predicting the joints’ relative depths) from multiple virtual viewpoints based on a single real-view image. By fusing these virtual-viewpoint skeletons, we can then estimate the final 3D human skeleton more accurately. Our network consists of two stages. The first stage is composed of a two-stream network: the Real-Net stream predicts 2D image coordinates and the relative depth for each joint from the real viewpoint, while the Virtual-Net stream estimates the relative depths in virtual viewpoints for the same joints. Our network’s second stage consists of a depth-denoising module, a cropped-to-original coordinate transform (COCT) module, and a fusion module. The goal of the fusion module is to fuse skeleton information from the real and virtual viewpoints so that it can undergo feature embedding, 2D-to-3D lifting, and regression to an accurate 3D skeleton. The experimental results demonstrate that our single-view method can achieve a performance of 45.7 mm on average per-joint position error, which is superior to that achieved in several other prior studies of the same kind and is comparable to that of other sequence-based methods that accept tens of consecutive frames as the input. Full article
Show Figures

Figure 1

18 pages, 2211 KiB  
Article
Accuracy Evaluation of 3D Pose Reconstruction Algorithms Through Stereo Camera Information Fusion for Physical Exercises with MediaPipe Pose
by Sebastian Dill, Arjang Ahmadi, Martin Grimmer, Dennis Haufe, Maurice Rohr, Yanhua Zhao, Maziar Sharbafi and Christoph Hoog Antink
Sensors 2024, 24(23), 7772; https://doi.org/10.3390/s24237772 - 4 Dec 2024
Viewed by 1010
Abstract
In recent years, significant research has been conducted on video-based human pose estimation (HPE). While monocular two-dimensional (2D) HPE has been shown to achieve high performance, monocular three-dimensional (3D) HPE poses a more challenging problem. However, since human motion happens in a 3D [...] Read more.
In recent years, significant research has been conducted on video-based human pose estimation (HPE). While monocular two-dimensional (2D) HPE has been shown to achieve high performance, monocular three-dimensional (3D) HPE poses a more challenging problem. However, since human motion happens in a 3D space, 3D HPE offers a more accurate representation of the human, granting increased usability for complex tasks like analysis of physical exercise. We propose a method based on MediaPipe Pose, 2D HPE on stereo cameras and a fusion algorithm without prior stereo calibration to reconstruct 3D poses, combining the advantages of high accuracy in 2D HPE with the increased usability of 3D coordinates. We evaluate this method on a self-recorded database focused on physical exercise to research what accuracy can be achieved and whether this accuracy is sufficient to recognize errors in exercise performance. We find that our method achieves significantly improved performance compared to monocular 3D HPE (median RMSE of 30.1 compared to 56.3, p-value below 106) and can show that the performance is sufficient for error recognition. Full article
Show Figures

Figure 1

16 pages, 8471 KiB  
Article
Replay-Based Incremental Learning Framework for Gesture Recognition Overcoming the Time-Varying Characteristics of sEMG Signals
by Xingguo Zhang, Tengfei Li, Maoxun Sun, Lei Zhang, Cheng Zhang and Yue Zhang
Sensors 2024, 24(22), 7198; https://doi.org/10.3390/s24227198 - 10 Nov 2024
Viewed by 933
Abstract
Gesture recognition techniques based on surface electromyography (sEMG) signals face instability problems caused by electrode displacement and the time-varying characteristics of the signals in cross-time applications. This study proposes an incremental learning framework based on densely connected convolutional networks (DenseNet) to capture non-synchronous [...] Read more.
Gesture recognition techniques based on surface electromyography (sEMG) signals face instability problems caused by electrode displacement and the time-varying characteristics of the signals in cross-time applications. This study proposes an incremental learning framework based on densely connected convolutional networks (DenseNet) to capture non-synchronous data features and overcome catastrophic forgetting by constructing replay datasets that store data with different time spans and jointly participate in model training. The results show that, after multiple increments, the framework achieves an average recognition rate of 96.5% from eight subjects, which is significantly better than that of cross-day analysis. The density-based spatial clustering of applications with noise (DBSCAN) algorithm is utilized to select representative samples to update the replayed dataset, achieving a 93.7% recognition rate with fewer samples, which is better than the other three conventional sample selection methods. In addition, a comparison of full dataset training with incremental learning training demonstrates that the framework improves the recognition rate by nearly 1%, exhibits better recognition performance, significantly shortens the training time, reduces the cost of model updating and iteration, and is more suitable for practical applications. This study also investigates the use of the incremental learning of action classes, achieving an average recognition rate of 88.6%, which facilitates the supplementation of action types according to the demand, and further improves the application value of the action pattern recognition technology based on sEMG signals. Full article
Show Figures

Figure 1

16 pages, 8982 KiB  
Article
A Two-Stream Method for Human Action Recognition Using Facial Action Cues
by Zhimao Lai, Yan Zhang and Xiubo Liang
Sensors 2024, 24(21), 6817; https://doi.org/10.3390/s24216817 - 23 Oct 2024
Viewed by 999
Abstract
Human action recognition (HAR) is a critical area in computer vision with wide-ranging applications, including video surveillance, healthcare monitoring, and abnormal behavior detection. Current HAR methods predominantly rely on full-body data, which can limit their effectiveness in real-world scenarios where occlusion is common. [...] Read more.
Human action recognition (HAR) is a critical area in computer vision with wide-ranging applications, including video surveillance, healthcare monitoring, and abnormal behavior detection. Current HAR methods predominantly rely on full-body data, which can limit their effectiveness in real-world scenarios where occlusion is common. In such situations, the face often remains visible, providing valuable cues for action recognition. This paper introduces Face in Action (FIA), a novel two-stream method that leverages facial action cues for robust action recognition under conditions of significant occlusion. FIA consists of an RGB stream and a landmark stream. The RGB stream processes facial image sequences using a fine-spatio-multitemporal (FSM) 3D convolution module, which employs smaller spatial receptive fields to capture detailed local facial movements and larger temporal receptive fields to model broader temporal dynamics. The landmark stream processes facial landmark sequences using a normalized temporal attention (NTA) module within an NTA-GCN block, enhancing the detection of key facial frames and improving overall recognition accuracy. We validate the effectiveness of FIA using the NTU RGB+D and NTU RGB+D 120 datasets, focusing on action categories related to medical conditions. Our experiments demonstrate that FIA significantly outperforms existing methods in scenarios with extensive occlusion, highlighting its potential for practical applications in surveillance and healthcare settings. Full article
Show Figures

Figure 1

17 pages, 1952 KiB  
Article
Cross-Domain Human Activity Recognition Using Low-Resolution Infrared Sensors
by Guillermo Diaz, Bo Tan, Iker Sobron, Iñaki Eizmendi, Iratxe Landa and Manuel Velez
Sensors 2024, 24(19), 6388; https://doi.org/10.3390/s24196388 - 2 Oct 2024
Cited by 1 | Viewed by 956
Abstract
This paper investigates the feasibility of cross-domain recognition for human activities captured using low-resolution 8 × 8 infrared sensors in indoor environments. To achieve this, a novel prototype recurrent convolutional network (PRCN) was evaluated using a few-shot learning strategy, classifying up to eleven [...] Read more.
This paper investigates the feasibility of cross-domain recognition for human activities captured using low-resolution 8 × 8 infrared sensors in indoor environments. To achieve this, a novel prototype recurrent convolutional network (PRCN) was evaluated using a few-shot learning strategy, classifying up to eleven activity classes in scenarios where one or two individuals engaged in daily tasks. The model was tested on two independent datasets, with real-world measurements. Initially, three different networks were compared as feature extractors within the prototype network. Following this, a cross-domain evaluation was conducted between the real datasets. The results demonstrated the model’s effectiveness, showing that it performed well regardless of the diversity of samples in the training dataset. Full article
Show Figures

Figure 1

20 pages, 6995 KiB  
Article
Research on Human Posture Estimation Algorithm Based on YOLO-Pose
by Jing Ding, Shanwei Niu, Zhigang Nie and Wenyu Zhu
Sensors 2024, 24(10), 3036; https://doi.org/10.3390/s24103036 - 10 May 2024
Cited by 4 | Viewed by 4464
Abstract
In response to the numerous challenges faced by traditional human pose recognition methods in practical applications, such as dense targets, severe edge occlusion, limited application scenarios, complex backgrounds, and poor recognition accuracy when targets are occluded, this paper proposes a YOLO-Pose algorithm for [...] Read more.
In response to the numerous challenges faced by traditional human pose recognition methods in practical applications, such as dense targets, severe edge occlusion, limited application scenarios, complex backgrounds, and poor recognition accuracy when targets are occluded, this paper proposes a YOLO-Pose algorithm for human pose estimation. The specific improvements are divided into four parts. Firstly, in the Backbone section of the YOLO-Pose model, lightweight GhostNet modules are introduced to reduce the model’s parameter count and computational requirements, making it suitable for deployment on unmanned aerial vehicles (UAVs). Secondly, the ACmix attention mechanism is integrated into the Neck section to improve detection speed during object judgment and localization. Furthermore, in the Head section, key points are optimized using coordinate attention mechanisms, significantly enhancing key point localization accuracy. Lastly, the paper improves the loss function and confidence function to enhance the model’s robustness. Experimental results demonstrate that the improved model achieves a 95.58% improvement in mAP50 and a 69.54% improvement in mAP50-95 compared to the original model, with a reduction of 14.6 M parameters. The model achieves a detection speed of 19.9 ms per image, optimized by 30% and 39.5% compared to the original model. Comparisons with other algorithms such as Faster R-CNN, SSD, YOLOv4, and YOLOv7 demonstrate varying degrees of performance improvement. Full article
Show Figures

Figure 1

21 pages, 2576 KiB  
Article
A Deep Learning Model for Markerless Pose Estimation Based on Keypoint Augmentation: What Factors Influence Errors in Biomechanical Applications?
by Ana V. Ruescas-Nicolau, Enrique Medina-Ripoll, Helios de Rosario, Joaquín Sanchiz Navarro, Eduardo Parrilla and María Carmen Juan Lizandra
Sensors 2024, 24(6), 1923; https://doi.org/10.3390/s24061923 - 17 Mar 2024
Cited by 2 | Viewed by 1869
Abstract
In biomechanics, movement is typically recorded by tracking the trajectories of anatomical landmarks previously marked using passive instrumentation, which entails several inconveniences. To overcome these disadvantages, researchers are exploring different markerless methods, such as pose estimation networks, to capture movement with equivalent accuracy [...] Read more.
In biomechanics, movement is typically recorded by tracking the trajectories of anatomical landmarks previously marked using passive instrumentation, which entails several inconveniences. To overcome these disadvantages, researchers are exploring different markerless methods, such as pose estimation networks, to capture movement with equivalent accuracy to marker-based photogrammetry. However, pose estimation models usually only provide joint centers, which are incomplete data for calculating joint angles in all anatomical axes. Recently, marker augmentation models based on deep learning have emerged. These models transform pose estimation data into complete anatomical data. Building on this concept, this study presents three marker augmentation models of varying complexity that were compared to a photogrammetry system. The errors in anatomical landmark positions and the derived joint angles were calculated, and a statistical analysis of the errors was performed to identify the factors that most influence their magnitude. The proposed Transformer model improved upon the errors reported in the literature, yielding position errors of less than 1.5 cm for anatomical landmarks and 4.4 degrees for all seven movements evaluated. Anthropometric data did not influence the errors, while anatomical landmarks and movement influenced position errors, and model, rotation axis, and movement influenced joint angle errors. Full article
Show Figures

Figure 1

15 pages, 2806 KiB  
Article
Human Pose Estimation Based on Efficient and Lightweight High-Resolution Network (EL-HRNet)
by Rui Li, An Yan, Shiqiang Yang, Duo He, Xin Zeng and Hongyan Liu
Sensors 2024, 24(2), 396; https://doi.org/10.3390/s24020396 - 9 Jan 2024
Cited by 7 | Viewed by 2883
Abstract
As an important direction in computer vision, human pose estimation has received extensive attention in recent years. A High-Resolution Network (HRNet) can achieve effective estimation results as a classical human pose estimation method. However, the complex structure of the model is not conducive [...] Read more.
As an important direction in computer vision, human pose estimation has received extensive attention in recent years. A High-Resolution Network (HRNet) can achieve effective estimation results as a classical human pose estimation method. However, the complex structure of the model is not conducive to deployment under limited computer resources. Therefore, an improved Efficient and Lightweight HRNet (EL-HRNet) model is proposed. In detail, point-wise and grouped convolutions were used to construct a lightweight residual module, replacing the original 3 × 3 module to reduce the parameters. To compensate for the information loss caused by the network’s lightweight nature, the Convolutional Block Attention Module (CBAM) is introduced after the new lightweight residual module to construct the Lightweight Attention Basicblock (LA-Basicblock) module to achieve high-precision human pose estimation. To verify the effectiveness of the proposed EL-HRNet, experiments were carried out using the COCO2017 and MPII datasets. The experimental results show that the EL-HRNet model requires only 5 million parameters and 2.0 GFlops calculations and achieves an AP score of 67.1% on the COCO2017 validation set. In addition, [email protected] is 87.7% on the MPII validation set, and EL-HRNet shows a good balance between model complexity and human pose estimation accuracy. Full article
Show Figures

Figure 1

25 pages, 1469 KiB  
Article
PosturePose: Optimized Posture Analysis for Semi-Supervised Monocular 3D Human Pose Estimation
by Lawrence Amadi and Gady Agam
Sensors 2023, 23(24), 9749; https://doi.org/10.3390/s23249749 - 11 Dec 2023
Cited by 3 | Viewed by 1615
Abstract
One motivation for studying semi-supervised techniques for human pose estimation is to compensate for the lack of variety in curated 3D human pose datasets by combining labeled 3D pose data with readily available unlabeled video data—effectively, leveraging the annotations of the former and [...] Read more.
One motivation for studying semi-supervised techniques for human pose estimation is to compensate for the lack of variety in curated 3D human pose datasets by combining labeled 3D pose data with readily available unlabeled video data—effectively, leveraging the annotations of the former and the rich variety of the latter to train more robust pose estimators. In this paper, we propose a novel, fully differentiable posture consistency loss that is unaffected by camera orientation and improves monocular human pose estimators trained with limited labeled 3D pose data. Our semi-supervised monocular 3D pose framework combines biomechanical pose regularization with a multi-view posture (and pose) consistency objective function. We show that posture optimization was effective at decreasing pose estimation errors when applied to a 2D–3D lifting network (VPose3D) and two well-studied datasets (H36M and 3DHP). Specifically, the proposed semi-supervised framework with multi-view posture and pose loss lowered the mean per-joint position error (MPJPE) of leading semi-supervised methods by up to 15% (−7.6 mm) when camera parameters of unlabeled poses were provided. Without camera parameters, our semi-supervised framework with posture loss improved semi-supervised state-of-the-art methods by 17% (−15.6 mm decrease in MPJPE). Overall, our pose models compete favorably with other high-performing pose models trained under similar conditions with limited labeled data. Full article
Show Figures

Figure 1

14 pages, 608 KiB  
Article
GaitSG: Gait Recognition with SMPLs in Graph Structure
by Jiayi Yan, Shaohui Wang, Jing Lin, Peihao Li, Ruxin Zhang and Haoqian Wang
Sensors 2023, 23(20), 8627; https://doi.org/10.3390/s23208627 - 22 Oct 2023
Viewed by 1722
Abstract
Gait recognition aims to identify a person based on his unique walking pattern. Compared with silhouettes and skeletons, skinned multi-person linear (SMPL) models can simultaneously provide human pose and shape information and are robust to viewpoint and clothing variances. However, previous approaches have [...] Read more.
Gait recognition aims to identify a person based on his unique walking pattern. Compared with silhouettes and skeletons, skinned multi-person linear (SMPL) models can simultaneously provide human pose and shape information and are robust to viewpoint and clothing variances. However, previous approaches have only considered SMPL parameters as a whole and are yet to explore their potential for gait recognition thoroughly. To address this problem, we concentrate on SMPL representations and propose a novel SMPL-based method named GaitSG for gait recognition, which takes SMPL parameters in the graph structure as input. Specifically, we represent the SMPL model as graph nodes and employ graph convolution techniques to effectively model the human model topology and generate discriminative gait features. Further, we utilize prior knowledge of the human body and elaborately design a novel part graph pooling block, PGPB, to encode viewpoint information explicitly. The PGPB also alleviates the physical distance-unaware limitation of the graph structure. Comprehensive experiments on public gait recognition datasets, Gait3D and CASIA-B, demonstrate that GaitSG can achieve better performance and faster convergence than existing model-based approaches. Specifically, compared with the baseline SMPLGait (3D only), our model achieves approximately twice the Rank-1 accuracy and requires three times fewer training iterations on Gait3D. Full article
Show Figures

Figure 1

19 pages, 4744 KiB  
Article
DUA: A Domain-Unified Approach for Cross-Dataset 3D Human Pose Estimation
by João Renato Ribeiro Manesco, Stefano Berretti and Aparecido Nilceu Marana
Sensors 2023, 23(17), 7312; https://doi.org/10.3390/s23177312 - 22 Aug 2023
Cited by 2 | Viewed by 1557
Abstract
Human pose estimation is an important Computer Vision problem, whose goal is to estimate the human body through joints. Currently, methods that employ deep learning techniques excel in the task of 2D human pose estimation. However, the use of 3D poses can bring [...] Read more.
Human pose estimation is an important Computer Vision problem, whose goal is to estimate the human body through joints. Currently, methods that employ deep learning techniques excel in the task of 2D human pose estimation. However, the use of 3D poses can bring more accurate and robust results. Since 3D pose labels can only be acquired in restricted scenarios, fully convolutional methods tend to perform poorly on the task. One strategy to solve this problem is to use 2D pose estimators, to estimate 3D poses in two steps using 2D pose inputs. Due to database acquisition constraints, the performance improvement of this strategy can only be observed in controlled environments, therefore domain adaptation techniques can be used to increase the generalization capability of the system by inserting information from synthetic domains. In this work, we propose a novel method called Domain Unified approach, aimed at solving pose misalignment problems on a cross-dataset scenario, through a combination of three modules on top of the pose estimator: pose converter, uncertainty estimator, and domain classifier. Our method led to a 44.1mm (29.24%) error reduction, when training with the SURREAL synthetic dataset and evaluating with Human3.6M over a no-adaption scenario, achieving state-of-the-art performance. Full article
Show Figures

Figure 1

17 pages, 4685 KiB  
Article
Research Method of Discontinuous-Gait Image Recognition Based on Human Skeleton Keypoint Extraction
by Kun Han and Xinyu Li
Sensors 2023, 23(16), 7274; https://doi.org/10.3390/s23167274 - 19 Aug 2023
Cited by 5 | Viewed by 1512
Abstract
As a biological characteristic, gait uses the posture characteristics of human walking for identification, which has the advantages of a long recognition distance and no requirement for the cooperation of subjects. This paper proposes a research method for recognising gait images at the [...] Read more.
As a biological characteristic, gait uses the posture characteristics of human walking for identification, which has the advantages of a long recognition distance and no requirement for the cooperation of subjects. This paper proposes a research method for recognising gait images at the frame level, even in cases of discontinuity, based on human keypoint extraction. In order to reduce the dependence of the network on the temporal characteristics of the image sequence during the training process, a discontinuous frame screening module is added to the front end of the gait feature extraction network, to restrict the image information input to the network. Gait feature extraction adds a cross-stage partial connection (CSP) structure to the spatial–temporal graph convolutional networks’ bottleneck structure in the ResGCN network, to effectively filter interference information. It also inserts XBNBlock, on the basis of the CSP structure, to reduce estimation caused by network layer deepening and small-batch-size training. The experimental results of our model on the gait dataset CASIA-B achieve an average recognition accuracy of 79.5%. The proposed method can also achieve 78.1% accuracy on the CASIA-B sample, after training with a limited number of image frames, which means that the model is more robust. Full article
Show Figures

Figure 1

19 pages, 4363 KiB  
Article
TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition
by Kaixuan Wang and Hongmin Deng
Sensors 2023, 23(12), 5593; https://doi.org/10.3390/s23125593 - 15 Jun 2023
Cited by 6 | Viewed by 2623
Abstract
For skeleton-based action recognition, graph convolutional networks (GCN) have absolute advantages. Existing state-of-the-art (SOTA) methods tended to focus on extracting and identifying features from all bones and joints. However, they ignored many new input features which could be discovered. Moreover, many GCN-based action [...] Read more.
For skeleton-based action recognition, graph convolutional networks (GCN) have absolute advantages. Existing state-of-the-art (SOTA) methods tended to focus on extracting and identifying features from all bones and joints. However, they ignored many new input features which could be discovered. Moreover, many GCN-based action recognition models did not pay sufficient attention to the extraction of temporal features. In addition, most models had swollen structures due to too many parameters. In order to solve the problems mentioned above, a temporal feature cross-extraction graph convolutional network (TFC-GCN) is proposed, which has a small number of parameters. Firstly, we propose the feature extraction strategy of the relative displacements of joints, which is fitted for the relative displacement between its previous and subsequent frames. Then, TFC-GCN uses a temporal feature cross-extraction block with gated information filtering to excavate high-level representations for human actions. Finally, we propose a stitching spatial–temporal attention (SST-Att) block for different joints to be given different weights so as to obtain favorable results for classification. FLOPs and the number of parameters of TFC-GCN reach 1.90 G and 0.18 M, respectively. The superiority has been verified on three large-scale public datasets, namely NTU RGB + D60, NTU RGB + D120 and UAV-Human. Full article
Show Figures

Figure 1

18 pages, 13751 KiB  
Article
HPnet: Hybrid Parallel Network for Human Pose Estimation
by Haoran Li, Hongxun Yao and Yuxin Hou
Sensors 2023, 23(9), 4425; https://doi.org/10.3390/s23094425 - 30 Apr 2023
Cited by 3 | Viewed by 1908
Abstract
Hybrid models which combine the convolution and transformer model achieve impressive performance on human pose estimation. However, the existing hybrid models on human pose estimation, which typically stack self-attention modules after convolution, are prone to mutual conflict. The mutual conflict enforces one type [...] Read more.
Hybrid models which combine the convolution and transformer model achieve impressive performance on human pose estimation. However, the existing hybrid models on human pose estimation, which typically stack self-attention modules after convolution, are prone to mutual conflict. The mutual conflict enforces one type of module to dominate over these hybrid sequential models. Consequently, the performance of higher-precision keypoints localization is not consistent with overall performance. To alleviate this mutual conflict, we developed a hybrid parallel network by parallelizing the self-attention modules and the convolution modules, which conduce to leverage the complementary capabilities effectively. The parallel network ensures that the self-attention branch tends to model the long-range dependency to enhance the semantic representation, whereas the local sensitivity of the convolution branch contributes to high-precision localization simultaneously. To further mitigate the conflict, we proposed a cross-branches attention module to gate the features generated by both branches along the channel dimension. The hybrid parallel network achieves 75.6% and 75.4%AP on COCO validation and test-dev sets and achieves consistent performance on both higher-precision localization and overall performance. The experiments show that our hybrid parallel network is on par with the state-of-the-art human pose estimation models. Full article
Show Figures

Figure 1

28 pages, 22634 KiB  
Article
WiTransformer: A Novel Robust Gesture Recognition Sensing Model with WiFi
by Mingze Yang, Hai Zhu, Runzhe Zhu, Fei Wu, Ling Yin and Yuncheng Yang
Sensors 2023, 23(5), 2612; https://doi.org/10.3390/s23052612 - 27 Feb 2023
Cited by 9 | Viewed by 3711
Abstract
The past decade has demonstrated the potential of human activity recognition (HAR) with WiFi signals owing to non-invasiveness and ubiquity. Previous research has largely concentrated on enhancing precision through sophisticated models. However, the complexity of recognition tasks has been largely neglected. Thus, the [...] Read more.
The past decade has demonstrated the potential of human activity recognition (HAR) with WiFi signals owing to non-invasiveness and ubiquity. Previous research has largely concentrated on enhancing precision through sophisticated models. However, the complexity of recognition tasks has been largely neglected. Thus, the performance of the HAR system is markedly diminished when tasked with increasing complexities, such as a larger classification number, the confusion of similar actions, and signal distortion To address this issue, we eliminated conventional convolutional and recurrent backbones and proposed WiTransformer, a novel tactic based on pure Transformers. Nevertheless, Transformer-like models are typically suited to large-scale datasets as pretraining models, according to the experience of the Vision Transformer. Therefore, we adopted the Body-coordinate Velocity Profile, a cross-domain WiFi signal feature derived from the channel state information, to reduce the threshold of the Transformers. Based on this, we propose two modified transformer architectures, united spatiotemporal Transformer (UST) and separated spatiotemporal Transformer (SST) to realize WiFi-based human gesture recognition models with task robustness. SST intuitively extracts spatial and temporal data features using two encoders, respectively. By contrast, UST can extract the same three-dimensional features with only a one-dimensional encoder, owing to its well-designed structure. We evaluated SST and UST on four designed task datasets (TDSs) with varying task complexities. The experimental results demonstrate that UST has achieved recognition accuracy of 86.16% on the most complex task dataset TDSs-22, outperforming the other popular backbones. Simultaneously, the accuracy decreases by at most 3.18% when the task complexity increases from TDSs-6 to TDSs-22, which is 0.14–0.2 times that of others. However, as predicted and analyzed, SST fails because of excessive lack of inductive bias and the limited scale of the training data. Full article
Show Figures

Figure 1

14 pages, 2437 KiB  
Article
Pose Mask: A Model-Based Augmentation Method for 2D Pose Estimation in Classroom Scenes Using Surveillance Images
by Shichang Liu, Miao Ma, Haiyang Li, Hanyang Ning and Min Wang
Sensors 2022, 22(21), 8331; https://doi.org/10.3390/s22218331 - 30 Oct 2022
Cited by 1 | Viewed by 2588
Abstract
Solid developments have been seen in deep-learning-based pose estimation, but few works have explored performance in dense crowds, such as a classroom scene; furthermore, no specific knowledge is considered in the design of image augmentation for pose estimation. A masked autoencoder was shown [...] Read more.
Solid developments have been seen in deep-learning-based pose estimation, but few works have explored performance in dense crowds, such as a classroom scene; furthermore, no specific knowledge is considered in the design of image augmentation for pose estimation. A masked autoencoder was shown to have a non-negligible capability in image reconstruction, where the masking mechanism that randomly drops patches forces the model to build unknown pixels from known pixels. Inspired by this self-supervised learning method, where the restoration of the feature loss induced by the mask is consistent with tackling the occlusion problem in classroom scenarios, we discovered that the transfer performance of the pre-trained weights could be used as a model-based augmentation to overcome the intractable occlusion in classroom pose estimation. In this study, we proposed a top-down pose estimation method that utilized the natural reconstruction capability of missing information of the MAE as an effective occluded image augmentation in a pose estimation task. The difference with the original MAE was that instead of using a 75% random mask ratio, we regarded the keypoint distribution probabilistic heatmap as a reference for masking, which we named Pose Mask. To test the performance of our method in heavily occluded classroom scenes, we collected a new dataset for pose estimation in classroom scenes named Class Pose and conducted many experiments, the results of which showed promising performance. Full article
Show Figures

Figure 1

Back to TopTop