1. Introduction
Robot assistants have recently emerged as a promising solution for elderly care and monitoring in the indoor domestic environment. The increasing demand for service robotic platforms for indoor assistance has paved the way for the development of diverse robotic solutions, especially devoted to elderly care [
1,
2]. According to the World Population Prospects (2019) provided by the United Nations [
3], life expectancy reached 72.6 years, with a future expectation of 77.1 in 2050. Furthermore, projections reveal that there will be more people aged 65 years or over than young aged 15 to 24 years by 2050 [
3]. Population ageing dramatically impacts our society’s organization, exacerbating delicate issues, such as the isolation of numerous vulnerable subjects and elderly people in their homes for most of the day. Moreover, the recent emergency related to the COVID-19 outbreak has further increased the need for a reliable and automatic assistance tool in both hospitals and patients’ residential environments. In this scenario, robots were demonstrated to be a key technological ally in fighting the pandemic and its dramatic social effects, such as isolation [
4,
5]. Indeed, they can offer support to both medical staff and families whenever the services of dedicated assistive operators or volunteers are not available due to the intensive demand generated by the pandemic.
Robotic solutions are often focused on the interactive social aspects with the user [
6,
7], or, conversely, they try to continuously check the health status of the patient [
8,
9]. However, a reliable and effective navigation algorithmic stack is a necessary condition to realistically deploy a robotic platform in a cluttered environment for humans. The most recent advances in human-aware robot navigation [
10] show how planning and control algorithms can be successfully adapted to social circumstances.
For the specific case of a robotic assistant that aims at constantly monitoring, accompanying, and supporting the user within the domestic or medical environment, the ability to follow the person is crucial. Indeed, person-following [
11,
12] is the primary challenging task to enable any visual or vocal interaction with the robot, while the user is moving around. On the other hand, reduced mobility subjects may also need the robot to accomplish desired services in the room, moving around towards different destinations. Keeping an eye on the users during the execution of such secondary functions constitutes a huge benefit for the robot assistant’s main goal: monitoring the person’s condition. Person-following and goal-based navigation probably represent an indoor robot assistant’s two most common navigation behaviors. However, monitoring while navigating might raise some serious difficulties in the case of conventional differential drive platforms, which do not have the possibility to describe a curved motion without a change in orientation. This limitation often leads differential drive robots to lose the human target while avoiding obstacles or following an occluded path. The same argument does not apply to an omnidirectional platform: in this case, the robot can handle its motion along all horizontal plane directions without changing its orientation.
In this work, we focus our research on the development of a human-centered autonomous navigation system for a robotic assistant, which aims at fulfilling the user assistance requirement in both the described scenarios: goal-based navigation (
Figure 1) and person-following (
Figure 2). Hence, we decided to adopt a tiny-size omnidirectional robotic base platform, to fully exploit its kinematic advantages and propose an optimized person-following methodology, always guaranteeing collision-free trajectory planning combined with continuous visual tracking of the user. Nonetheless, our solution also enables the robot assistant to move toward the desired destination while adjusting the orientation of the platform to keep active visual contact with the user. This results in increased reliability of the robotic assistant, which is able to perform different tasks while continuously checking the status of the person and calling for help in case dangerous situations are detected.
We first set up a real-time perception pipeline to identify and track the person’s pose
. This position is exploited in the case of person-following, where it constitutes the dynamic goal of navigation. Differently, in the case of a goal-based navigation task, the goal is represented by the desired coordinates
. A local planner generates a collision-free trajectory, handling the linear velocity commands
and
, while an additional module tunes the control of the angular yaw velocity
, in order to constantly maintain the orientation towards the person (
Figure 1).
The contribution of this work is threefold:
We identify an omnidirectional motion planning approach as a robust, effective solution to boost the mobility of a robotic assistant during its principal navigation activities (person-following and goal-based navigation);
We set up a real-time, cost-effective perception pipeline to extract the coordinate of the person and visually track its pose;
We effectively integrate a navigation algorithmic stack that separately handles trajectory generation for obstacle avoidance and orientation control for person monitoring.
Nonetheless, compared to most previous works, we carried out extensive experimentation for both person-following and static goal navigation with the robot. To this end, we set up an innovative experimental framework based on an ultra-wideband (UWB) anchors system to localize both the person and the robot while moving and measure their relative distance and orientation. Our results validate the performance of our solution and show the competitive advantage and robustness it can provide in visually monitoring the user while avoiding obstacles in a cluttered indoor environment, such as a domestic one.
The article is organized as follows. In
Section 2, we discuss related works presented in literature. In
Section 3, we first introduce the human-centered navigation tasks, then we discuss the core methods of our solution, describing the perception and the omnidirectional navigation algorithms. In
Section 4, the experimental settings and validation scenarios for both person-following and goal-based navigation are thoroughly presented, discussing the relevance of the obtained results.
Section 5 conclude the article and propose possible future works.
3. Human-Centered Autonomous Navigation
We define human-centered navigation as the service robotic task of autonomously navigating within a domestic environment while maintaining constant track of the subject of interest. On this basis, we propose a novel system to handle human-centered autonomous navigation in cluttered and unstructured environments, using an omnidirectional robotic platform (
Figure 3). We define two different use cases: in the first, the rover has to move towards a series of specific destinations, keeping visual contact with the user during the whole operation. In the second, the rover performs a person-following task where the position of the subject, extracted from the perception system, is used as a dynamic goal for navigation. According to this concept, the autonomous platform should always be aware of the subject’s position during its navigation, which means keeping its orientation towards the person and maintaining them in the camera’s field of view.
3.1. Perception and Tracking
In this work, we developed a deep learning perception pipeline that allows the robot to visually track the person. The scheme presented in
Figure 4 describes the complete perception pipeline used to extract, at each time instant, the coordinate of the person in the robot reference frame from RGB-D images. A RealSense D435i Depth Camera, mounted on the rover at a human height, is used to collect color images of the environment. In a first step, the person’s presence is detected through PoseNet [
36], a lightweight deep neural network that estimates the pose of humans in images and videos. For each person present in the scene, the network outputs the position of 17 key joints (such as elbows, shoulders, or feet). In our implementation, PoseNet runs on the Google Coral Edge TPU device
https://coral.ai (accessed on 11 July 2020) at 30 frame-per-second (FPS), which corresponds to the maximum frame rate supported by the RealSense D435i camera. The key-points predicted by PoseNet are then translated into a bounding box that localizes the person within the image. The resulting bounding box is tracked with SORT [
37], a very simple online and real-time tracking algorithm based on the Kalman filter. SORT also keeps track of the subject when they leave the frame for a few moments, and associates an ID to each person in the image. This ID is maintained as long as the person does not leave the frame for several time instants.
At this point, a depth image extracted from the RealSense camera, aligned with the RGB image, can be used to extrapolate the relative position of the detected individual in the robot reference frame . To do so, it is necessary to identify a precise area, or better, a specific point of the image where we can confidently expect to find the person. For this purpose, the output key-joints of PoseNet represent particularly suitable information: in comparison, a conventional person detection approach can only localize the person in an approximate bounding box area. This information is inadequate for the person position tracking task, since the bounding box contains points associated with the person and points belonging to the background. The risk is that the system could treat a point of the image belonging to the background as a point belonging to the human body, causing an error in the correct evaluation of the subject position. A set of particularly reliable key points of the estimated pose is selected to find the person’s center point C on the color image. When the neural network identifies both the shoulders of a person with high confidence, the point C is selected as the average of these joints. If shoulders are not recognized, but the hips are, then the selected point becomes the one between the two hips. If neither shoulders nor hips are recognized with a certain degree of confidence, then the detection of the person is considered invalid. This structure guarantees reliable esteem of the person’s position in the environment to be fully usable by the robot navigation system, avoiding the risk of misleading target estimates and, consequently, inaccurate motion planning. The distance of the person from the robot is then extracted from the depth frame as the value corresponding to the point C. At each time instant, the complete information contained in the resulting array is translated into the person’s position in the robot’s reference frame , with basic reference frame transformations. This position will be used by the navigation control stack described in the following section.
As an interesting point of discussion, we found that the detection of people present in the image could not be sufficient to efficiently track a specific human subject. In particular, two well-known problems could arise:
Especially in crowded environments, where multiple people are present in every frame, the subject could be mistaken for another person in the image (or vice-versa);
Without a component capable of tracking observations at previous time instants, it could be very difficult to guarantee real-time performances if the detection of the subject is lost for a few consecutive frames. This problem can be particularly critical in all those situations with an occluded view of the subject due to obstacles or other people present in the scene.
Although a person re-identification algorithm could mitigate the first problem allowing to recognize a specific person, at the cost of additional computation, dealing with the second can be much more arduous without a proper component specifically designed for tracking. Aiming to solve both the problems at the most convenient computational cost, we decided to adopt SORT in the person detection pipeline to exploit also future estimates of the person pose and allow the rover to keep tracking the desired subject as long as necessary, discriminating from other people in the scene. Nonetheless, a Re-Identification neural network could be easily integrated as the first stage of our pipeline if strictly requested by the particular case study.
3.2. Omnidirectional Motion Planner and Obstacle Avoidance
Typically, a navigation system requires some fundamental components. The first necessity is to localize the robot in the operating environment. In order to compute the trajectory towards a goal, the system needs to acquire the pose (position and orientation) of the rover with respect to a fixed reference frame. This piece of information needs to be retrieved with a certain frequency to ensure real-time performances, since we need to maintain track of the position of the rover during time as it moves towards a different location. Obviously, the required frequency of localization acquisition increases as the speeds assumed by the platform grow. In our implementation, we exploited a RealSense T265 Tracking Camera to obtain information about the rover’s pose. This camera employs state-of-the-art Visual Inertial Odometry (VIO) algorithms, which use visual information to provide odometry data at a frequency of roughly 200 fps, more than sufficient for any indoor autonomous platform.
Second, we need a path planner. If a map of the operative scenario is provided from the beginning, it is possible to compute an optimal trajectory knowing a priori the location of each obstacle (global planner). However, in the majority of service robotics navigation cases, a global planner is not sufficient, since a map is not always available. Moreover, if the map is also given, real-life domestic environments are highly dynamic environments, where obstacles’ position could be changed over time (chairs, bins) or they can move on their own (people, animals, other autonomous platforms). In these cases, a real-time perception system together with a local planner is necessary to dynamically re-plan the upcoming commands on the base of the last perceived data, and perform an effective obstacle avoidance. The visual perception pipeline described in
Section 3.1 is uniquely used to extract the coordinates of the person
in the scene. Differently, we use an RPLiDAR A1 to retrieve 2D laser scan distance measurements of the obstacles around the robot at each time instant, which are subsequently used to feed a local path planner.
Information regarding the rover’s and obstacles’ position is passed to the navigation system, which we developed tailoring the Navigation2 navigation stack (
https://navigation.ros.org/ (accessed on 13 July 2022)) for the specific use case of assistance and person monitoring. Nav2 is a highly modular navigation system based on behavior trees, which allows integration with custom plugins adapted for any specific application. It provides default modules for converting laser scan data into cost-map representation, planning a path towards a goal, and controlling the rover along it. Although Nav2 is a very complete system for a conventional navigation application, we needed to modify it extensively to customize the overall algorithmic stack to handle both person-following and goal-based navigation with a unique solution for person-monitoring, integrating new plugins and behavior tree entries.
Since domestic scenarios fall within unstructured environments, for which a map is rarely provided, we decided to focus on a local planner. This option allows the rover to be deployed in unknown scenarios without the need for preliminary information, since the system plans its navigation paths depending only on real-time spatial data deriving from the LiDAR sensor. The resulting navigation system consists of a DWB local planner and controller, able to generate an obstacle-free trajectory towards the goal and drive the rover along it. To detach the control of linear and angular velocities, we decided to forbid the DWB to include the yaw velocity in the dynamic path planning, forcing it, instead, to plan a safe trajectory and control the rover using only the two linear velocities , along x and y axes of the horizontal plane. The goal of the navigation task coincides with the person’s position in the specific case of the person-following, diversely it is a separate target point to be reached while monitoring the person in the service navigation scenario.
3.3. Person-Focused Orientation Control
The angular velocity
is provided by another system node, which at any instant computes the angular difference
between the orientation of the rover and the orientation of the vector connecting the rover’s center of rotation with the person position, retrieved from the perception module:
The exact yaw velocity is then calculated as follow:
where
After some tests on our indoor application, we found optimal values of these parameters, respectively, at 1.3 and 1.5 rad/s, but they can be changed depending on the specific operating scenario.
Figure 5 resumes the complete proposed human-centered navigation system. The upper blue section of the scheme contains the extraction of the person position
in the robot reference frame through the visual perception pipeline (presented in
Section 3.1). The yaw controller then processes this position to obtain the angular velocity command
needed to keep the platform oriented towards the person. On the lower red section of the scheme, the DWB local planner receives the LiDAR range points and the goal coordinate
to produce a collision-free trajectory and provide linear velocities
. The full velocity command for the robot is, therefore, obtained by combining linear and angular velocities in the vector
. Obviously, the view of the subject can be occluded by physical obstacles, but if the RGB camera is mounted on the robot at a height greater than most objects in the operating environment (such as tables, chairs, sofas, desks), as we did on our platform, the rover can navigate through cluttered spaces still maintaining its sight centered on the user.
This intuition was initially conceived for an autonomous indoor assistant, addressed to elderly or disabled users who need constant monitoring, even when the platform needs to move to another place of the room to carry out a different task, but can be adapted to many different applications, for example, in all the situations in which the platform has to perform a specific operation while constantly focusing on another human operator, for monitoring purposes or to receive new instruction through visual inputs. This application can be particularly useful in the fight against COVID-19 and future pandemics for assisting patients in hospitals and their houses. The rover can replace medical personnel’s intervention, greatly reducing the risk of contagion and spread of the virus, continue to monitor the patient, and eventually request human help in case of abnormal situations.
4. Experiments and Results
For our experimentation, we used a cheap omnidirectional robotic platform with four mecanum wheels, presented in [
13]. The whole software system is executed on a single Intel NUC11TNHv5 PC, directly integrated within the rover. As stated before, the platform mounts an RPLiDAR A1 sensor, a RealSense D435i camera for person detection, and a RealSense T265 camera for visual odometry. Overall, the platform presents a very basic configuration, easily replicable with simple commercial components on a generic omnidirectional platform. In this sense, our solution is cost-effective, avoiding the necessity of more complex and expensive sensors and systems for person tracking, such as active gimbals or 360-degree cameras. Furthermore, the software system is lightweight enough to run on integrated hardware at the edge and reach real-time performances.
All the software components and technologies needed to perceive and navigate the environment have to be merged into a single organic system, in order to fulfill the different tasks. The most widespread solution in literature requires using a Middleware [
38], an abstraction layer that resides between the operating system and software applications. In this work, we decided to adopt the Robot Operating System 2 (ROS2)
https://docs.ros.org/en/foxy/index.html (accessed on 18 July 2022), due to the variety of compatible algorithms and the very active community supporting it. It provides several advantages and improvements compared to the original ROS
https://www.ros.org/ (accessed on 28 July 2022) since it is more suitable for real-time systems and it has access to more advanced applications [
39]. ROS2 is based on a Data Distribution Service (DDS) structure, with nodes that publish and subscribe to different topics.
Two different kinds of experiments are conducted:
The first experimental stage aims at demonstrating the efficiency of the person-centered navigation task for monitoring purposes, where the rover has to navigate from a point A to a target point B of coordinate , maintaining its focus on the subject located in ;
The second series of experiments take into consideration the person-following task, where and coincide and represent the dynamic goal obtained from the visual perception pipeline, which identifies and tracks the person of interest.
For these tests, the system has been integrated with additional functionalities to refine the platform’s behavior and further increase the person’s awareness during the navigation.
Safety distance module During the rover operation, the user’s safety should always be ensured, even if this leads to the failure of the requested task. For this reason, a module able to truncate the navigation path of the rover is inserted, which guarantees a minimum distance of one meter always to be maintained from any person.
Recovery policy for person tracking During the navigation towards a specified goal, the rover may lose track of the person. In case the track is not resumed within a certain time interval, a specific module we added sends a command to the rover to interrupt the navigation and to start rotating towards the direction the person was last perceived in an attempt to regain visual contact with the user.
Recovery policy for person-following The same problem described above can occur during the person-following task but, in this case, consequences could be even worse since the knowledge of the person’s position affects not only the yaw but also the linear directions of the navigation. To re-establish track with the person, first of all the rover heads towards the last known position of the user, maintaining its orientation towards that location. This decision compensate for all those cases in which the person takes a turn behind an obstacle, such as a wall, and simply moving towards the corner of the curve where the user was last seen is enough to regain visual contact. If this should reveal not sufficient, once the robot has reached the last known position, it starts rotating as described before.
For each tested scenario, tests are performed with the same omnidirectional rover in two different configurations. In the first configuration, the rover adopts our novel navigation methodology: it plans collision-free trajectories fully exploiting its omnidirectional kinematics, combining both the two linear velocities . The angular yaw velocity is controlled by the person tracking module to always maintain visual contact with the followed person. In the second configuration, the rover behaves like a differential platform. This means it can only exploit velocity , while control of velocity is denied, and the angular yaw velocity is solely dedicated to navigation purposes. This procedure allows the comparison of performances between our solution and a generic differential platform in tracking the user.
4.1. Person-Centered Navigation
Tests are performed in two different scenarios, depicting a
hallway characterized by low walls, which represent any potential obstacle present in a realistic domestic scene. The rover camera can see over walls, but the platform is forced to avoid them in order to reach its goal. The starting point and the destination
are the same in the two cases. What changes is the position of the person
: near the destination point in the first scenario (
Figure 6a), and in the corner of the hallway in the second (
Figure 6b). In these preliminary trials, the person maintains their position during the whole extent of the test. The rover odometry data are acquired with a frequency of 5 Hz.
Seven tests are performed for each scenario and both configurations, omnidirectional and differential. The error term is represented by the angular difference between the orientation of the rover and the orientation of the vector connecting the rover’s center of rotation with the person’s position. The horizontal FOV of the RealSense D435i (RGB stream) is equal to . The angular difference should never be higher than half this angle, approximately , to constantly keep track of the person’s position.
Considered metrics for each test are the average angular difference
with its standard deviation, the root mean square error (RMSE), and the mean absolute error (MAE) maintained along the whole path, considering
as the optimal value. In
Table 1 are reported, for each scenario and each metric, the average value computed over all the different tests, and the percentage of improvement introduced by our methodology.
As seen from the results and
Figure 6, the omnidirectional system is able to efficiently navigate towards the goal, constantly maintaining its orientation towards the person. The
angular error is kept at extremely low average values equal to
and
, respectively, in the two scenarios. Furthermore, the maximum recorded value of
does not exceed
, which is well below the limit of
imposed by the camera’s FOV. This means the system can keep tracking the person for the whole extent of the navigation. Moreover, from data collected during the experimentation, the perception and tracking system described at
Section 3.1 was able to correctly recognize and localize the followed person within the environment on average 29 times per second. On the other hand, velocity commands are provided with frequencies over 15 fps at any time.
For comparison, we also added the results obtained with the differential drive configuration. However, this comparison is uneven: as explained before, a differential platform has to choose whether to navigate towards the goal or remain orientated towards the person. This is particularly evident in the second scenario, where the person and the goal have two completely different positions.
4.2. Person Following
For the person-following task, tests are performed in four different scenarios. The geometric configuration can be seen in
Figure 7. Similar to the previous test stage, obstacles are constituted by low walls, except for the fourth, where they are full-height walls. Contrary to the previous case, the person to be followed moves for the whole extent of the test. The rover has to follow the person, using the position
extracted from the visual perception pipeline as a dynamic goal of the navigation. For this reason, to ensure an accurate ground truth data collection, we set up a localization system based on four ultra-wideband anchors placed in the testing area. One additional anchor is placed upon the rover, and the followed person holds a second one. The rover’s orientation is also aligned with the one used by the ultra-wideband system. In such a way, it is possible to know the actual relative position between rover and followed person. This allows us to correctly compute the angular difference
at any time instant. To our knowledge, this experimental setting is the first attempt in the literature to quantitatively measure a person’s quality following system performance, going beyond the typical qualitative evaluation.
As already completed for the first test, seven validation runs are performed for each rover configuration in every scenario. The same error term
and metrics discussed in the previous section are used to evaluate the person-following performances. Results can be consulted in
Table 2. Furthermore, in
Figure 8 and
Figure 9, for each scenario and each configuration, a visualization of the performed test is reported. The gridmaps reported in the figure are directly obtained from the rover during the navigation, while rover and person poses are obtained from the ultra-wideband system. As can be seen, our methodology proves to robustly track the followed person more effectively than a traditional differential drive navigation in all the considered scenarios. In the omnidirectional configuration (
Figure 8a,c and
Figure 9a,c), the rover manages to always maintain the user within the camera’s view, contrary to the differential drive case, where the visual contact is instead lost several times. This generally leads to higher performance in following the user, with the rover planning more optimal collision-free trajectories, fully satisfying also the person monitoring requirement. The obtained values of
clearly show the performance gap in all scenarios, demonstrating the successful behavior in monitoring the person provided by our solution. Additionally, in the fourth scenario (
Figure 9c,d), where after the curve the wall obstructs the rover’s view of the user, it appears clear that the ability to remain facing the human dynamic goal allows for a more accurate re-acquisition of tracking as soon as the obstacle is passed. In this last scenario, the differential drive system registers the highest orientation error, with a substantial
average gap from our solution.
5. Conclusions and Future Works
In this work, we propose a novel, cost-effective approach for human-centered autonomous navigation in the context of domestic robotic assistance. In particular, we devote a great focus on developing a robust solution to visually monitor the user in two different case studies, which we consider the most relevant and common for a robot assistant: person monitoring during navigation to a target goal and person-following. Differently from previous works, the core of our robot assistive solution relies on the idea that keeping the platform oriented towards the subject permits us to continuously check their status, also when the robot is moving and avoiding obstacles typically present in a realistic indoor environment. To this end, we first set up a real-time visual perception pipeline that reliably provides the coordinate of the person in the robot reference frame using a cheap RGB-D camera. Then, adopting a generic omnidirectional platform, we propose a navigation system that separately treats orientation control and dynamic trajectory planning to fulfill both the monitoring and the obstacle avoidance objectives of the robotic assistive task. Our extensive experimentation conducted for both the considered use cases in realistic settings demonstrates the competitive advantages and the robustness of our solution compared to a common differential drive navigation. Moreover, it also advances the typical experimental framework for person-following, quantitatively evaluating the physical tracking of the person with an ultra-wideband localization system. To our knowledge, this is the first study to investigate omnidirectional capability of a robotic platform to enable true human-centered navigation, where the care and attention for the user’s health are considered the main focus of the navigation task. Future works may investigate the integration of a person re-identification deep neural network in the visual perception pipeline to recognize a specific user, which will contribute significantly to a real application.