1. Summary
Human motion analysis commonly relies on optoelectronic systems that track small retroreflective markers attached to the subject’s body. These systems, although extremely accurate, are characterized by high costs and complex setups. Such characteristics constrain their use to specific applications that are confined in a dedicated laboratory (e.g., clinical analyses or animation industry motion capture). However, real-time human pose estimation could benefit a variety of fields, ranging from human–robot interaction, Industry 4.0, autonomous driving, surveillance, and telerehabilitation. In such contexts, the deployment of optoelectronic systems is usually not feasible, and markerless analyses are a promising tool to address this issue.
Markerless body pose estimation (BPE) has been a topic of intensive research for decades in the computer vision community. Despite the improvements achieved in the latest years thanks to the advances enabled by data-driven approaches [
1,
2,
3,
4], the accurate assessment of human motion without relying on any sensor or marker attached to the body is still an open challenge. Limited fields of view of the cameras and occlusions due to the environment, but also self-occlusions of the human body, limit the accuracy of such systems. One possible solution to reduce the impact of the aforementioned limitations consists of exploiting a distributed camera network to acquire data of the same scene from multiple viewpoints. By fusing the partial information obtained from each camera, it is possible to reduce the effect of occlusions and, at the same time, increase the overall system’s accuracy.
In recent years, the development of portable and easy-to-use low-cost 3D cameras (e.g., the Microsoft Kinect, Microsoft Corp., Redmond, WA, USA) has further pushed the interest in markerless BPE [
5,
6,
7,
8]. The main advantage of these devices is the possibility to retrieve real-time synchronized RGB and depth data of the scene, up to 30
. However, despite the widespread use of such sensors and the variety of available human motion datasets, only a small number of public datasets include RGB-D data and even less offer multiple calibrated RGB-D views. In fact, to the best of the authors’ knowledge, a comprehensive dataset including complex scenes with multiple people, RGB, and depth data from a significant calibrated RGB-D camera network, together with ground truth body poses for all the recorded sequences, is still missing. All the most used markerless motion capture datasets (either focused on BPE or on action recognition) lack at least one of the aforementioned features.
HumanEva [
9] is one of the first and most used datasets recorded for benchmarking markerless human pose estimation algorithms. The dataset includes 6 actions of daily living (ADLs) recorded by 4 different actors using 4 grayscale cameras, 3 RGB cameras, and a marker-based optoelectronic system as a ground truth. No information on the depth of the scene is available, and each sequence only involves a single person.
Human3.6M [
10], on the other hand, offers depth data of the scene using a single Time-of-Flight (ToF) sensor. Also in this case, ground truth poses are acquired via marker-based motion capture, while visual data are recorded using 4 RGB cameras. The dataset includes a predefined set of 16 ADLs performed by 11 actors. Even in this case, no interactions among subjects are available.
Our previous work, the
IAS-Lab Action Dataset [
11], was one of the first to include RGB-D sensors in the acquisition setup. This dataset consists of 15 ADLs performed by 12 people. RGB and depth data are provided, as well as the persons’ body poses estimated by exploiting a markerless BPE algorithm. However, data are recorded using a single Kinect v1 camera. Additionally, no ground truth poses are available, nor are sequences with multiple people.
Berkeley MHAD [
12] is one of the first datasets to include accelerometers in the acquisition setup. Eleven ADLs performed by 12 actors are recorded using marker-based motion capture, 12 RGB cameras, 2 Kinect v1 cameras, and 6 accelerometers. However, similarly to the previous works, the focus is on estimating single persons’ actions, and no interactions are taken into account.
TUM Shelf [
13] is among the most used datasets for benchmarking markerless BPE algorithms. It includes 5 RGB cameras to record a group of 4 people disassembling a shelf. Severe occlusions and unbounded motion of the persons are the main challenges of this dataset. However, since no other sensing devices are involved, the dataset offers only sparse manually annotated poses as a ground truth. The same authors also released the
TUM Campus dataset [
13]. The particularity of this dataset is that it is captured outdoors. The recorded scenes depict 3 people interacting on campus grounds. Similar to
TUM Shelf, only 3 RGB cameras are used. Thus, the same limitations apply.
CMU Panoptic [
14] is a large-scale dataset that includes 480 VGA cameras, 31 HD cameras, and 10 Kinect v2 cameras. A variety of actions (including both single-person and multi-person activities) are recorded inside a custom-built dome accommodating all the hardware. However, since vision is the only modality used to retrieve data, the recorded poses are only computed via triangulation based on a 2D BPE algorithm that runs on each camera, without any external ground truth.
Another public dataset including multiple depth views is the
NTU RGB+D dataset [
15]. Forty subjects were recorded performing a set of 60 actions that include ADLs, mutual activities, and health-related movements. The sensors used to extract the persons’ poses were 3 Kinect v2 cameras. However, since the focus is on the validation of action recognition algorithms, no ground truth poses are provided, but only labels indicating the type of actions being performed.
All the aforementioned datasets mainly focused on vision, including markerless and marker-based motion capture.
UTD-MHAD [
16], on the other hand, introduced the use of one inertial measurement unit (IMU), in conjunction with a Kinect v1 camera. Eight subjects were individually recorded while performing a set of 27 predefined actions ranging from sports, hand gestures, ADLs, and training exercises. Similarly to the previous work, however, the focus is on action recognition. Thus, the available ground truth is limited to manually annotated labels describing the actions being performed.
Total Capture [
17] is a widely used dataset and one of the first to introduce the usage of a full-body inertial suit consisting of 13 IMUs, alongside 8 RGB cameras and marker-based motion capture. Five subjects are recorded performing a set of 5 actions selected from range of motion activities, walking, acting, running, and freestyle. Ground truth poses are computed via marker-based motion capture. However, the dataset does not include interactions between subjects, and no information on the depth of the scene is available.
AndyData-lab [
18], similarly to the previous work, includes data from marker-based motion capture, a full-body inertial suit, 2 RGB cameras, while also adding finger pressure sensors. Since this work focuses on human motion analysis in industrial settings, 13 subjects are recorded while performing 6 industrial tasks, including screwing at different heights and manipulating loads. As in the previous work, neither interactions among subjects nor information on the depth of the scene are available.
Finally,
Human4D [
19] includes data from an optoelectronic system and 4 Intel RealSense RGB-D cameras (Intel Corp., Santa Clara, CA, USA). Four actors are recorded, both individually and in pairs, while performing a set of 14 single-person ADLs and 5 two-person activities in a professional motion capture studio. Ground truth poses are collected via marker-based motion capture, and both RGB and depth recordings of the scene are available. However, during the recordings, all actors needed to wear a full-body black suit to accommodate the body markers required by the optoelectronic system during the entire trial. These artificial clothes can hinder the performance of RGB-based markerless BPE algorithms, potentially decreasing their accuracy, since they do not constitute a realistic scenario.
This paper presents the University of Padova Body Pose Estimation dataset (
UNIPD-BPE), an extensive dataset for multi-sensor BPE containing a large number of single-person and multi-person sequences with up to 4 people interacting. Full-body poses, as well as raw data from each sensor, are recorded both by means of a calibrated network with 5 RGB-D cameras (i.e., Microsoft Azure Kinect, Microsoft Corp., Redmond, WA, USA) and by exploiting up to 2 highly accurate full-body inertial suits (i.e., Xsens MVN Awinda, Xsens Technologies, Enschede, Netherlands). All recorded data are publicly available under the Creative Commons CC0 license at
https://doi.org/10.17605/OSF.IO/YJ9Q4.
The Azure Kinect is the latest RGB-D camera developed by Microsoft, with improved performance compared to the previous model (Kinect v2). As demonstrated in [
20], the Azure Kinect standard deviation is reduced by more than 50% with respect to the Kinect v2, while also achieving a depth estimation error lower than 11
. For these reasons, the Azure Kinect is a promising device with a wide range of uses including object recognition, people tracking and detection, and human–computer interaction. This dataset is the first to include high-definition RGB, depth, and BPE data from 5 calibrated Azure Kinect cameras. Videos and point clouds are recorded both at a resolution of 1920 × 1080 pixels @ 30
and 640 × 576 pixels @ 30
(native resolution of the depth sensor). Moreover, all subjects’ body poses are estimated via markerless motion capture by exploiting the Azure Kinect Body Tracking SDK [
21], offering baseline data to develop and benchmark different BPE and tracking algorithms. The high number of cameras allows us to assess the impact of different camera network configurations on the accuracy achieved by markerless BPE algorithms, while the high-resolution recordings allow us to quantify how different image resolutions can impact a specific algorithm.
The
UNIPD-BPE dataset also contains full-body inertial motion capture data, collected by up to 2 Xsens MVN Awinda suits. Each suit consists of 17 MTw Awinda trackers, including a 3-axis gyroscope, a 3-axis accelerometer, and a 3-axis magnetometer. As demonstrated in [
22], these sensors are extremely accurate for inertial BPE. Each tracker has a dynamic accuracy of
RMS for roll and pitch, and
RMS for the heading estimation, constituting a flexible and reliable tool for capturing human motion [
23]. The proposed dataset includes both the raw data from each tracker, and detailed data describing each subject’s body kinematics, computed by exploiting the MVN Analyze software. Such software combines the data of all motion trackers with a biomechanical model of the human, allowing to obtain an accurate and drift-free estimate of the body pose [
24]. The hardware/software combination used on this work allowed to record raw IMU data (estimated orientations, angular velocities, linear accelerations, magnetic fields) for all the trackers required by each suit @ 60
, as well as 3D positions, orientations, velocities, accelerations of the 23 segments defining the Xsens biomechanical model, anatomical joint angles of 22 joints plus 6 additional joint angles targeted to ergonomic analyses, and the body center of mass location throughout all the sequences.
No optoelectronic data are included in this dataset because the required markers attached to the body are highly reflective, resulting in a strong distortion in the Kinects’ depth and, consequently, in a poor estimation of the body pose. While it is possible to properly synchronize the two systems to avoid interference, this solution still degrades the Azure Kinect’s performance. Therefore, to ensure maximum accuracy of the recorded markerless data, we chose to employ an inertial motion capture system in place of the optoelectronic one. The software used for the estimation of the body poses (Xsens MVN Analyze), coupled with the chosen hardware (Xsens MVN Awinda), allows us to obtain an accuracy comparable to state-of-the-art optoelectronic systems, as demonstrated in [
24].
All the cameras and inertial suits used in this work are hardware synchronized, while the relative poses of each camera with respect to the inertial reference frame are calibrated before each sequence to ensure maximum overlap of the two sensing systems outputs. The proposed setup allowed to record synchronized 3D poses of the persons on the scene both via Xsens’ inverse kinematics algorithm (inertial motion capture) and by exploiting the Azure Kinect Body tracking SDK (markerless motion capture), simultaneously. The additional raw data (RGB, depth, camera network configuration) allow the user to assess the performance of any custom markerless motion capture algorithm (based on RGB, depth, or both). Further analyses can be progressed by varying the number of cameras being used and/or their resolution and frame rate. Moreover, raw angular velocities, linear accelerations, magnetic fields, and orientations from each IMU allow to develop and test multimodal BPE approaches focused on merging visual and inertial data. Finally, the precise body dimensions of each subject are provided. They include body height, weight, and segment lengths measured before the beginning of a recording session. They were used to scale the Xsens biomechanical model, and also constitute a ground truth for assessing the markerless BPE accuracy on estimating each subject’s body dimensions.
The recorded sequences include 15 participants performing a set of 12 ADLs (e.g., walking, sitting, and jogging). The actions were chosen to present different challenges to BPE algorithms, including different movement speeds, self-occlusions, and complex body poses. Moreover, multi-person sequences, with up to 4 people performing a set of 7 different actions, are provided. Such sequences offer challenging scenarios where multiple self-occluded persons move and interact in a restricted space. They allow assessing the accuracy of multi-person tracking algorithms, focused on maintaining frame-by-frame consistent IDs of each detected person. To this end, the proposed dataset has already been used to validate our previous work, describing a real-time open-source framework for multi-camera multi-person tracking [
25]. A total of
(over 1,400,000 frames) of RGB, depth, and markerless BPE data from 5 RGB-D cameras are present in the dataset, while the inertial motion capture system allowed to record 3
(over 600,000 frames) of human poses, corresponding to
of raw IMU data from all the sensors used in each suit.
The remainder of the paper is organized as follows.
Section 2 describes the content and organization of the dataset.
Section 3 presents the methods applied for data collection and describes how to replicate the setup used for the acquisitions. Finally,
Section 4 concludes the article, addressing possible uses of the dataset in different research fields.
4. Conclusions
This paper presented UNIPD-BPE, an extensive dataset for single- and multi-person body pose estimation. Single-person sequences include 15 participants performing a set of 12 activities of daily living, while multi-person sequences include 7 actions with 2 to 4 persons interacting in a confined area.
The dataset includes of high definition RGB and depth data (corresponding to over 1,400,000 frames) recorded by a calibrated RGB-D camera network of 5 synchronized Azure Kinect cameras, as well as each subject’s full-body poses estimated using the Azure Kinect Body Tracking SDK. This allows to assess the impact of exploiting different numbers and/or configurations of cameras on the accuracy achieved by markerless BPE algorithms. The provided markerless body poses can be used as a baseline, while the raw recorded data (RGB, depth, and camera network configuration) allow the dataset user to assess the performance and accuracy of any custom markerless BPE algorithm (based on RGB, depth, or both).
Furthermore, 3 of inertial motion capture poses were obtained by exploiting highly accurate Xsens MVN Awinda full-body suits, corresponding to a total of over 600,000 frames recorded by each of the 17 IMUs used by every suit. All sensors are hardware-synchronized, with the Xsens MVN Awinda system acting as a master to trigger the acquisitions. The relative poses of each camera with respect to the inertial reference frame are accurately calibrated before each sequence to ensure the best overlap of the two systems’ outputs. This allows inertial motion capture estimates to be used to further investigate the accuracy of different markerless BPE algorithms. Since the raw IMU data are also available, the dataset can also be used to develop novel sensor fusion algorithms, aiming at improving the performance of both markerless motion capture, by increasing the achievable accuracy, and inertial motion capture, by limiting possible drifting phenomena.
The multi-person sequences offer challenging scenarios where multiple partially occluded persons move and interact in a restricted space. This allows us to investigate the performance of multi-person tracking algorithms, both regarding the accuracy of the pose estimation in cluttered environments, and the ability to maintain frame-by-frame consistent IDs of each detected person in the scene.
The proposed dataset also presents some limitations. Due to the hardware used in the RGB-D camera network, no optoelectronic data could be included. This would offer an additional source of information, also allowing us to assess the accuracy of inertial motion capture. Moreover, the main focus of the dataset is on the validation of different BPE algorithms. As a result, all recordings were acquired in a laboratory environment, with a limited amount of background clutter, to ensure the best overlap between markerless and inertial body poses.
To conclude, the UNIPD-BPE dataset aims to push forward the development of markerless BPE and tracking algorithms, enabling a variety of applications where unobtrusive accurate knowledge of human motion is of paramount importance. The dataset in fact includes data both for single-person RGB- and depth-based human motion estimation, for multi-person BPE and tracking, and for visual and inertial sensor fusion. The high-definition videos and point clouds, recorded by 5 calibrated and synchronized RGB-D cameras, allow simulating a variety of different scenarios (e.g., a pure RGB camera network, a pure depth camera network, an uncalibrated camera network, etc.). Finally, the included markerless and inertial body poses are useful for the development and testing of different multimodal sensor fusion and people tracking algorithms, without the necessity of expensive hardware and bulky acquisition setups.