Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion

Itahashi, Naoya; Itoh, Hideaki; Fukumoto, Hisao; Wakuya, Hiroshi

doi:10.3390/app14051803

Open AccessArticle

Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion

¹

Electrical and Electronic Engineering Course, Graduate School of Science and Engineering, Saga University, Saga 840-8502, Japan

²

Department of Electrical and Electronic Engineering, Faculty of Science and Engineering, Saga University, Saga 840-8502, Japan

³

Integrated Center for Educational Research and Development, Faculty of Education, Saga University, Saga 840-8502, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(5), 1803; https://doi.org/10.3390/app14051803

Submission received: 18 January 2024 / Revised: 19 February 2024 / Accepted: 21 February 2024 / Published: 22 February 2024

(This article belongs to the Section Robotics and Automation)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, a novel reinforcement learning method that enables a humanoid robot to learn bipedal walking using a simple reference motion is proposed. Reinforcement learning has recently emerged as a useful method for robots to learn bipedal walking, but, in many studies, a reference motion is necessary for successful learning, and it is laborious or costly to prepare a reference motion. To overcome this problem, our proposed method uses a simple reference motion consisting of three sine waves and automatically sets the waveform parameters using Bayesian optimization. Thus, the reference motion can easily be prepared with minimal human involvement. Moreover, we introduce two means to facilitate reinforcement learning: (1) we combine reinforcement learning with inverse kinematics (IK), and (2) we use the reference motion as a bias for the action determined via reinforcement learning, rather than as an imitation target. Through numerical experiments, we show that our proposed method enables bipedal walking to be learned based on a small number of samples. Furthermore, we conduct a zero-shot sim-to-real transfer experiment using a domain randomization method and demonstrate that a real humanoid robot, KHR-3HV, can walk with the controller acquired using the proposed method.

Keywords:

reinforcement learning; reference motion; humanoid; biped robot; inverse kinematics; Bayesian optimization

1. Introduction

Humanoid robots are capable of human-like movements and can therefore be useful in assisting and/or replacing humans in environments designed for humans. However, enabling humanoid robots to walk on two legs is a challenging task, due to the complex dynamics. There has been a long history of research on the design of bipedal gait controllers using mathematical models and dynamics, such as the inverted pendulum [1,2] and zero-moment point (ZMP) [3,4], but these approaches require sophisticated controller designs involving accurate mathematical models of the robot and its environment.

Recently, methods using reinforcement learning have been extensively studied (e.g., [5,6,7,8,9]) as a promising alternative approach. The reinforcement-learning-based methods are useful because they do not depend on a sophisticated controller designed by humans. The controller can be automatically acquired through trial and error.

However, learning bipedal walking from scratch by reinforcement learning (e.g., [10,11]) requires much time for the trial-and-error learning process and often results in unnatural gait patterns. In many studies, this problem has been addressed by providing the robot with a reference motion (e.g., [5,7,12,13]), which is an example motion that helps the robot to quickly learn a natural gait pattern.

Although the use of a reference motion enables the quick learning of a natural gait pattern, preparing the reference motion can be laborious or costly. For example, in some studies [7,12], expert controllers were used to generate a reference motion, but the preparation of an expert controller requires significant human effort. In other studies [5,13], reference motions were created by capturing human motion, but this approach requires costly motion capture equipment.

Therefore, it is highly desirable in the reinforcement learning of bipedal walking to use a simple reference motion that can easily be prepared. Since the purpose of using a reference motion is to help a robot to learn to walk, the reference motion does not need to be so elaborate that the robot can walk using the reference motion alone (if such an elaborate reference motion is available, there are other useful approaches, such as tracking control methods that guarantee prescribed performance [14,15]). However, little research has been conducted on the use of simple reference motions. An exception is the study by Wu et al. [16] on a robot with bird-like legs (see Section 2 for details), but their method is not readily applicable to a humanoid robot.

To address this issue, in this paper, we propose a novel reinforcement learning method that enables a humanoid robot to learn bipedal walking using only a simple reference motion. This is the first contribution of this paper. The reference motion that we use is a simple motion consisting of three sine waves, and the parameters that determine their waveforms are automatically set using Bayesian optimization. Thus, our reference motion can be prepared with minimal human involvement. To enable quick learning while using the simple reference motion, we introduce the following two means. First, we use inverse kinematics (IK). As will be described in Section 3, our simple reference motion only specifies the robot’s positions in relation to the torso. Thus, we use IK to identify all the angles of the leg joints that realize the specified sole positions. We expect that the reference motion expressed by the joint angles will provide the robot with more direct clues in learning how to walk. Second, we use the reference motion as a bias for the action determined by reinforcement learning, rather than as an imitation target. In many previous studies [5,7,12,13], the robot is given a reward function that encourages it to imitate a given reference motion. Although this approach is suitable when the goal of learning is to imitate the reference motion, such a reward function disturbs learning when the goal is not imitation itself. In the present study, by using the reference motion as a bias rather than as the imitation target, we allow our robot to focus on the true objective, i.e., learning bipedal walking. We conduct numerical experiments and show that our proposed method enables bipedal walking to be learned using a small number of samples.

Furthermore, we apply our method to a real robot. This is the second contribution of this paper. We use a small and inexpensive humanoid robot, KHR-3HV (ver. 2) manufactured by Kondo Kagaku Co., Ltd., Tokyo, Japan. We show that, through a sim-to-real approach using domain randomization, we can enable the real robot to walk in zero shots. In other words, the real robot can walk with the controller acquired using our proposed method in simulations, without any additional learning. To the best of our knowledge, this is the first successful example of a reinforcement-learning-based gait controller implemented on a real KHR-3HV robot.

The remainder of this paper is structured as follows. In Section 2, we discuss related work. In Section 3, we describe the proposed method. In Section 4, we explain the simulation experiments. In Section 5, we explain the experiments on a real robot. In Section 6, we state the conclusions.

2. Related Work

Similar to the present study, Wu et al. [16] proposed a reinforcement learning method that uses only a simple reference motion. In their study, the reference motion was represented by only two sine waves. They used a Cassie robot and showed that bipedal walking was successfully learned in a simulator. However, as described in Section 4, their method did not succeed in learning bipedal walking when applied on our humanoid robot. This may be due to the difference in the shape of the robots; the Cassie robot has bird-like legs and can walk without moving its hips to the left or right, whereas our robot is humanoid and needs to move its hips to the left or right while walking. In Section 4, we show that our proposed method is superior for our humanoid robot. Another difference is that our proposed method automatically searches for the sine wave parameters, whereas these must be provided by humans when using the method of Wu et al. Additionally, we apply our method to a real robot, whereas their study was conducted with a simulator only.

Various methods have been studied to assist the reinforcement learning of bipedal walking. Other than the reference-motion-based methods described in Section 1, the following methods have also been studied. Shi et al. [17] proposed a reference-free reinforcement learning method in which the robot is helped by the application of external forces. However, it is necessary for humans to set up a well-designed learning curriculum, such as where to initially apply and then how to gradually reduce the external forces so that the robot can eventually walk without them. Singh et al. [18] used a well-designed hand-crafted reward function and realized the learning of a natural bipedal walking pattern without using a reference motion. The reward function encourages the robot to realize, e.g., desirable foot–ground reaction forces depending on the phase of a gait cycle. However, it is necessary for humans to create such a reward function.

The approach of using IK with reinforcement learning has been adopted in several existing studies; for example, Kaymak et al. [8] proposed a gait controller that combined IK and reinforcement learning and applied it to DARwIn-OP, a small humanoid robot. However, in their study, some patterns of toe movements with parameters such as the stride length had to be prepared in advance, and the reinforcement learning method only optimized these parameters within a limited range. IK was used to convert the movement patterns into joint angles. In contrast, our proposed method optimizes the angle of every joint. Tutsoy et al. [19] proposed a method that involved learning to balance a NAO robot, using an approach similar to [8], where low-dimensional actions optimized by reinforcement learning are converted into joint angles using IK.

In the present study, the KHR-3HV robot (ver. 2) is used as the target robot. This robot is an inexpensive (less than USD 100), small humanoid robot and has been used for studies on various topics, including gait generation using an evolutionary algorithm [20], balancing on inclined surfaces [21], energy efficiency analysis for bipedal walking [22], and multimodal locomotion [23]. However, to the best of our knowledge, our study is the first in which a bipedal gait controller for a real robot is acquired using reinforcement learning. Compared to the Cassie robot, which has been extensively studied for bipedal walking by reinforcement learning [12,16,24,25], enabling the KHR-3HV robot to learn to walk might be more difficult. One reason is that, as mentioned above, Cassie has bird-like legs and the KHR-3HV is humanoid. Another reason is that Cassie’s servo motors are concentrated at the top of the legs, making the moments of the legs small, whereas the KHR-3HV’s servo motors are located inside the legs, making the leg moments large. We emphasize, however, that discussions of the superiority or inferiority of robots are beyond the scope of this paper.

3. Proposed Method

An overview of the proposed method is shown in Figure 1. In the proposed method, a neural network is trained using reinforcement learning. The environment of the robot is assumed to be a Markov decision process (MDP), which is defined by a 5-tuple

{S, A, T, R, γ}

, where S is the state space, A is the action space,

T

is the state transition function,

R

is the reward function, and

γ \in [0, 1)

is the discount factor. The details are explained in what follows.

Figure 2 shows the structure of the left leg of the KHR-3HV robot, and Table 1 shows its Denavit–Hartenberg (DH) parameters [26]. We use these parameters to solve the IK problem (Section 3.5). We note that these parameters are readily available when reinforcement learning is performed on a physics simulator, because such parameters are required by a physics simulator to simulate the robot’s motion.

3.1. Reference Motion

In the proposed method, a humanoid robot acquires a bipedal walking controller via reinforcement learning using a simple reference motion.

To keep the reference motion simple, we consider only three variables: the heights of the left and right soles and the lateral displacement of the soles relative to the torso. As the reference motion, we use the motion of walking in place, rather than walking forward. In the reference motion, the left and right soles are alternately raised; the torso is displaced in the right direction when the left sole is raised, and vice versa. It is not necessary for the robot to actually walk in place without falling down when controlled by the reference motion alone.

The reference motion is defined by three equations as

\begin{matrix} y_{t}^{ref} & = & sign (sin (2 π \frac{t}{T})) min (| w sin (2 π \frac{t}{T}) | + Δ w, w), \end{matrix}

(1)

\begin{matrix} z_{t}^{ref, left} & = & max (0, h sin (2 π \frac{t}{T}) - Δ h), \end{matrix}

(2)

\begin{matrix} z_{t}^{ref, right} & = & max (0, h sin (2 π \frac{t}{T} + π) - Δ h), \end{matrix}

(3)

where

t (= 0, 1, 2, \dots)

is the time step after an episode has started and T is a given period of the reference motion. An episode starts with the robot in the initial posture shown in Figure 1 and ends either when the robot falls down or when a certain number of steps has elapsed (in this study, we set it to 500).

Figure 3 shows example waveforms of these equations. The variable

y_{t}^{ref}

in Equation (1) is the lateral displacement of the soles of both feet relative to the torso. Equation (1) represents the movement of the torso to the left and right in accordance with the movement of the soles, as depicted in Figure 3. The variable

z_{t}^{ref, left}

in Equation (2) is the height at which the sole of the left foot is raised. Equation (2) represents the motion of raising the left foot during the first half of each walking cycle (Phase 3 in Figure 3). Similarly,

z_{t}^{ref, right}

in Equation (3) is the height at which the sole of the right foot is raised, and Equation (3) represents the motion of raising the right foot in the second half of each walking cycle (Phase 7 in Figure 3).

Equation (1) has two parameters, w and

Δ w

. The parameter w determines the maximum amount of lateral displacement. The larger the fraction

\frac{Δ w}{w}

, the larger the proportion of time in each walking cycle during which the lateral displacement is at its maximum value. In addition, there are two parameters, h and

Δ h

, in Equations (2) and (3). The value of

h - Δ h

determines the maximum height of the soles in the reference motion (see Phase 3 in Figure 3). The larger the fraction

\frac{Δ h}{h}

, the larger the proportion of time spent in the double support phase (i.e., the phase in which the two soles are touching the ground) in each walking cycle. This is because Equations (2) and (3) indicate that the heights of the soles,

z_{t}^{ref, left}

and

z_{t}^{ref, right}

, are both zero when time t satisfies

sin (2 π \frac{t}{T}) \leq \frac{Δ h}{h}

and

sin (2 π \frac{t}{T} + π) \leq \frac{Δ h}{h}

.

These four parameters,

w, Δ w, h,

and

Δ h

, are automatically optimized using Bayesian optimization during learning and thus do not need to be set by humans. Details of the Bayesian optimization are explained in Section 3.6.

As will be described in Section 3.5, we calculate all the angles of the robot’s leg joints that realize sole positions specified by

y_{t}^{ref}, z_{t}^{ref, left},

and

z_{t}^{ref, right}

. We use IK to calculate such angles, and we denote this by

a_{t}^{IK} = π_{IK} (y_{t}^{ref}, z_{t}^{ref, left}, z_{t}^{ref, right}),

(4)

where

a_{t}^{IK}

denotes the angles of all the joints. We call

a_{t}^{IK}

a reference bias because we use it as a bias for the reinforcement learning output (Equation (5)).

a_{t}^{IK}

is an N-dimensional vector when there are N servo motors to be controlled. In our robot,

N = 10

because we control 10 servo motors equipped within the two legs of the robot.

3.2. Action and Target Joint Angle

At each time t, we obtain output

a_{t}^{RL}

of the reinforcement learning and add it to the reference bias

a_{t}^{IK}

calculated from the reference motion as in the previous section. We let the result be the target joint angles

u_{t}^{target}

. Specifically,

u_{t}^{target} = a_{t}^{IK} + a_{t}^{RL}

(5)

(Figure 1), where

a_{t}^{RL}

is also an N-dimensional vector like

a_{t}^{IK}

. We assume that each joint of the robot has a servo motor that changes its angle by PID control so that the given target angle

u_{t}^{target}

is approached (Section 4 and Section 5).

Determining

u_{t}^{target}

in this way means that all the angles of the target servo motors can be directly optimized by reinforcement learning. The role of

a_{t}^{IK}

is simply to provide a bias for the reinforcement learning output. This is in contrast to the previous studies [8,19] mentioned in Section 2, in which IK was used to calculate the target angles for all servo motors from low-dimensional parameters or actions optimized by reinforcement learning.

We obtain the output of the reinforcement learning,

a_{t}^{RL}

, using the following procedure. First, when the current state

s_{t}

(see Section 3.3 for details) is input to the neural network, the action

a_{t}

is determined according to a probabilistic policy

π_{θ} (a_{t} | s_{t})

represented by the neural network. Here,

θ

denotes the parameters of the neural network. The action

a_{t}

is an N-dimensional vector, and each element of

a_{t}

takes a value between

- 1

and 1.

Next,

a_{t}

is passed through a low-pass filter (LPF) to improve the smoothness [27]. Specifically,

a_{t}^{'}

is calculated from

a_{t}

as

a_{t}^{'} = b a_{t} + (1 - b) a_{t - 1}^{'},

(6)

where

b (0 < b \leq 1)

is a smoothing factor, and

a_{t - 1}^{'}

is the smoothed value one step ago. We set

a_{- 1}^{'}

, which is a variable that is necessary to calculate Equation (6) when

t = 0

, to 0. We set

b = 0.2

in all the experiments in this study. The smoothed action

a_{t}^{'}

is also an N-dimensional vector, and each element of it takes a value between −1 and 1.

Finally,

a_{t}^{'}

is multiplied by a constant k to obtain the output of the reinforcement learning,

a_{t}^{RL}

. Specifically,

a_{t}^{RL} = k a_{t}^{'},

where k is a parameter that determines the range of possible values of

a_{t}^{RL}

, i.e., the range of search by the reinforcement learning (Figure 1). In our experiments, we set

k = 0.5 ω_{\max} d t

, where

ω_{\max}

is the maximum angular velocity of the servo motors and

d t

is the time interval of one step.

3.3. State

The state

s_{t}

consists of the observations obtained from the environment in the most recent five steps

{o_{t}, o_{t - 1}, o_{t - 2}, o_{t - 3}, o_{t - 4}}

and also of a phase vector (Figure 1).

The observation

o_{t}

at each time step is a 16-dimensional vector consisting of the 10 joint angles of the robot’s legs and the 3-dimensional linear acceleration and 3-dimensional angular velocity of the robot’s torso. Thus, the five observations

{o_{t}, o_{t - 1}, o_{t - 2}, o_{t - 3}, o_{t - 4}}

are represented by an 80-dimensional vector. The reason for using the past observations from

o_{t - 1}

to

o_{t - 4}

is to provide the policy with more information to infer the dynamics of the environment [28].

The phase vector is a 2-dimensional vector consisting of

{sin (\frac{2 π t}{T}), cos (\frac{2 π t}{T})}

, where T is the period of the walking motion to be learned and t is the time step since the start of the episode. It is known that learning may be facilitated by providing such a phase vector, since walking is a cyclic movement [29].

In total, the state

s_{t}

is an 82-dimensional vector. It is input to the neural network (Figure 1).

3.4. Reward

The task reward

r^{T}

, which evaluates how well the walking task is achieved, is defined as follows:

\begin{matrix} r^{T} = w_{1} r^{o} + w_{2} r^{v} + w_{3} r^{s} + w_{4} r^{a} + r^{f}, \\ w_{1} + w_{2} + w_{3} + w_{4} = 1, \end{matrix}

(7)

where

w_{1}, w_{2}, w_{3},

and

w_{4}

are the weights that balance each term. We did not assign a weight for

r^{f}

because it is a penalty term that takes a negative value, whereas the other terms take positive values. The details of each term are described below. The values of

w_{1}

to

w_{4}

and of the parameters within each term were determined through numerical experiments.

In many other studies that use a reference motion, an imitation reward

r^{I}

that evaluates the similarity between the reference motion and the robot’s motion is introduced, and the sum of the task reward

r^{T}

and the imitation reward

r^{I}

is maximized. In our proposed method, however, only the task reward is considered so that the reinforcement learning can focus on task accomplishment rather than on the imitation of the reference motion. The reference motion is only added to the output of the reinforcement learning as a bias (Equation (5)). In our experiments (Section 4.2 and Section 4.3), we compare our proposed method with two other methods that use imitation rewards.

3.4.1. Orientation Reward $r^{o}$

To encourage the robot to maintain an orientation consistent with the direction in which it should walk, we use

r^{o} = exp [- 75 (ϕ^{2} + θ^{2})],

(8)

where

ϕ, θ

are the roll and pitch angles of the robot’s torso. Thus, Equation (8) encourages the values of

ϕ

and

θ

to become zero.

3.4.2. Velocity Reward $r^{v}$

To encourage the robot’s speed to approach the target speed, we define the velocity reward, as follows, based on [10]:

\begin{matrix} r^{v} & = & \{\begin{matrix} exp (- l) & if sign (v_{x}^{I}) > 0 and r^{o} > 0.8, \\ 0 & otherwise, \end{matrix} \\ l & = & K_{x} [{(v_{x}^{cmd} - v_{x}^{b})}^{2} + {(v_{x}^{cmd} - v_{x}^{I})}^{2}] \\ + K_{y} [{(v_{y}^{cmd} - v_{y}^{b})}^{2} + {(v_{y}^{cmd} - v_{y}^{I})}^{2}] \\ + K_{ω} {(ω_{z}^{cmd} - ω_{z})}^{2}, \end{matrix}

(9)

where

v_{x}^{b}

and

v_{y}^{b}

are the linear velocities of the robot’s torso along the x and y axes in the world coordinate system, respectively.

v_{x}^{I}

and

v_{y}^{I}

are the linear velocities of the torso along the x and y axes in the local coordinate system of the torso, respectively.

ω_{z}

is the angular velocity around the z axis in the world coordinate system.

v_{x}^{cmd}

,

v_{y}^{cmd}

, and

ω_{z}^{cmd}

are the corresponding target values.

We note that in Equation (9), we have two types of torso velocities, i.e.,

v_{x}^{b}

and

v_{y}^{b}

, which are represented in the world coordinate system, as well as

v_{x}^{I}

and

v_{y}^{I}

, which are represented in the local coordinate system of the torso. In Equation (9), the terms involving

v_{x}^{b}

and

v_{y}^{b}

encourage the robot to move in the target direction in the world coordinate system, and the terms involving

v_{x}^{I}

and

v_{y}^{I}

encourage the robot to move forward in the local coordinate system. Without the former, the robot is not penalized for moving in a different direction. Without the latter, the robot is not penalized for walking in a strange manner, such as a sideways walk [10].

In all the experiments in this study, we set

v_{x}^{cmd} = 0.15

[m/s],

v_{y}^{cmd} = 0

[m/s], and

ω_{z}^{cmd} = 0

[rad/s]. The value of 0.15 m/s for

v_{x}^{cmd}

is faster than the maximum speed that the robot can physically achieve but was used to encourage the robot to walk as quickly as possible [30].

K_{x}, K_{y},

and

K_{ω}

are weight parameters. In all the experiments in this study, we set

K_{x} = 15

,

K_{y} = 50

, and

K_{ω} = 25

.

3.4.3. Symmetry Reward $r^{s}$

In order for the robot to keep its upper body upright, the left and right legs must be tilted approximately equally in the coronal plane, as depicted in Phase 2 in Figure 3. To encourage such tilts, we use

r^{s} = exp [- \frac{\sum_{i = 1}^{2} {(q_{i}^{left} - q_{i}^{right})}^{2}}{0.01}],

where

q_{1}^{left}

and

q_{1}^{right}

are the joint angles of the left and right hips (

q_{1}^{left}

is equal to

φ_{2}

in Figure 2), respectively, and

q_{2}^{left}

and

q_{2}^{right}

are the joint angles of the left and right ankles (

q_{2}^{left}

is equal to

φ_{6}

in Figure 2), respectively.

3.4.4. Survival Reward $r^{a}$

To encourage the robot not to fall down, we use a survival reward

r^{a}

. It is set to 1 when the robot has not fallen over and to 0 when the robot falls over. A fall of the robot is detected when the height of the torso is less than 170 mm, where the torso’s height in the initial posture (Figure 1) is 318 mm.

If the robot falls over, the episode ends, and the robot returns to the initial posture shown in Figure 1. The initial posture is a posture where the legs are slightly bent so that the robot’s center of gravity is 2 cm lower than that of the upright posture. The purpose of using this posture is to make the robot less likely to fall down on its first step. The displacements specified by the reference motion (

y_{t}^{ref}, z_{t}^{ref, left},

and

z_{t}^{ref, right}

in Equations (1)–(3)) are measured with regard to this initial posture.

3.4.5. Torque Penalty $r^{f}$

To encourage the smooth motion of the robot by punishing excessive movements of its joints, we use

r^{f} = - 0.01 \sum_{i = 1}^{N} | f_{i} |,

where

f_{i}

is the torque produced by the servo motor of the i-th joint.

3.5. Inverse Kinematics

As described in Section 3.1, at each time t, the positions of the left and right soles with respect to the torso are specified by the reference motion. In our reference motion, the soles are not rotated; they remain in the rotation of the initial posture. Thus, at each time t, the pose (i.e., both position and rotation) of each sole is specified.

For each of the left and right legs, we find the joint angles that place the sole in the specified pose (Equation (4)). This is an IK problem, and there are various types of methods to solve it. Here, we briefly explain the solution method that we used for the left leg. The solution for the right leg was obtained in the same way.

Let p be the specified pose of the left sole and q be the vector of the joint angles to be controlled, which are the angles from

φ_{2}

to

φ_{6}

shown in Figure 2 and Table 1. Given the DH parameters in Table 1, the forward kinematics function f that maps q to p is determined.

To find q for the given p, we numerically solved the following minimization problem

min E (q),

(10)

where

E (q)

is the error function defined as

E (q) = \frac{1}{2} e^{T} W_{e} e,

in which

e = p - f (q)

.

We used the Gauss–Newton method to solve Equation (10). Specifically, we calculated

\begin{matrix} q_{k + 1} & = & q_{k} + β {(J {(q_{k})}^{T} W_{e} J (q_{k}))}^{- 1} g_{k}, \\ g_{k} & = & J {(q_{k})}^{T} W_{e} e_{k} \end{matrix}

iteratively with

k = 0, 1, 2, \dots

. Here,

J (q_{k})

is the Jacobi matrix,

q_{0}

is the initial joint angle that we set to the zero vector, and

β

is a step length that we set to

0.5

. The iteration continues until the value for the error function

E (q_{k})

becomes less than a small constant

ϵ

or until k becomes 30. We set

ϵ = 10^{- 6}

.

3.6. Optimizing the Parameters of the Reference Motion

To determine the values of the parameters

w, Δ w, h,

and

Δ h

of the reference motion, we use Bayesian optimization [31], which is a method that efficiently searches for the maximum or minimum value of a function whose shape is unknown. We used the TPE algorithm [32] implemented in Optuna [33].

We optimize

w, Δ w, h,

and

Δ h

so as to maximize the following function

\begin{matrix} L (w, Δ w, h, Δ h) = \frac{1}{H} \sum_{e = 1}^{H} (\frac{1}{T_{e}} \sum_{t = 0}^{T_{e}} r_{e, t}), \end{matrix}

(11)

where

r_{e, t}

is the value of the reward

r^{T}

in Equation (7) obtained at step t of episode e,

T_{e}

is the number of time steps of episode e, and

\frac{1}{H} \sum_{e = 1}^{H} (\cdot)

takes the average over the most recent H episodes, in which

w, Δ w, h,

and

Δ h

were set to the values specified on the left-hand side of this equation.

Before the beginning of each episode, the TPE algorithm fixes the parameters

w, Δ w, h,

and

Δ h

at specific values. After the end of the episode, the rewards obtained during the episode are used to update

L (w, Δ w, h, Δ h)

through Equation (11). Using the updated

L (w, Δ w, h, Δ h)

, the algorithm sets the parameters to new values for the next episode. This is repeated many times, and, based on the collected values of

L (w, Δ w, h, Δ h)

, the algorithm gradually becomes able to estimate the optimal values for

w, Δ w, h,

and

Δ h

.

This Bayesian optimization is performed in parallel with the reinforcement learning process that updates policy

π_{θ}

. The changing of the parameters through Bayesian optimization was performed until the 4500-th episode, after which the parameters were fixed at the optimal values found.

3.7. Optimizing $π_{θ}$

In this study, we use the Proximal Policy Optimization (PPO) algorithm [34] for reinforcement learning. It has been suggested that PPO, compared with other deep reinforcement learning algorithms, is more suitable for the learning of bipedal gaits (e.g., [35]). PPO improves the policy

π_{θ} (a_{t} | s_{t})

by using a stochastic gradient ascent method, limiting abrupt changes in the policy when its parameter

θ

is updated. The clipped surrogate objective, which is the loss function of PPO, is defined as

L^{CLIP} (θ) = {\hat{E}}_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(12)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})},

{\hat{A}}_{t}

is the estimated advantage at time t, and

ϵ

is the clip parameter. We refer the reader to the original paper [34] for details.

4. Numerical Experiments

4.1. Simulation Setups

In this study, we used a 3D robot simulator, Webots [36], to perform the simulations required for reinforcement learning and Bayesian optimization. The Webots simulator performed the physical simulations at 50 Hz.

The target robot is a small humanoid robot, KHR-3HV (ver. 2). This robot has 22 degrees of freedom. To simplify the learning of walking, we controlled 10 joints of the legs. The angles of the other joints were fixed. Each joint angle was controlled by the PID controller of a servo motor, and the PID parameters were set to

K_{p} = 18.5

,

K_{I} = 0

, and

K_{D} = 0.009

. Webots has two parameters regarding the joints, which are dampingConstant and staticFriction, and we set them to

0.002

and

0.025

, respectively.

The values for

u_{t}^{target}

,

a_{t}^{IK}

, and

a_{t}^{RL}

in Equation (5) were calculated at 25 Hz, i.e., every 40 ms. We set the walking cycle T to 50 steps. Thus, one period of the reference motion was

40 \times 50 = 2000

ms.

For Bayesian optimization to determine the four parameters of the reference motion, as mentioned in Section 3.6, we used the TPE algorithm [32] implemented in Optuna [33]. We used the algorithm with its default parameter values. We set H to 8. We set the search space of each reference motion parameter to be integers within

10 \leq w \leq 20

[mm],

0 \leq Δ w \leq 10

[mm],

20 \leq h \leq 30

[mm], and

0 \leq Δ h \leq 10

[mm].

To implement the environment of the reinforcement learning, we used OpenAI Gym [37]. To implement the neural network and PPO algorithm, we used Stable-Baselines3 [38].

The neural network was a fully connected four-layer feedforward network consisting of an input layer of 84 neurons, two hidden layers of 64 neurons each, and an output layer of 20 neurons, with

\tanh

activation functions. The outputs of the neural network were sampled from Gaussian distributions whose means and standard deviations were determined by the activation values of the neurons of the output layer. Sampled values below −1 or above 1 were clipped to −1 or 1, respectively.

Table 2 lists the values for the PPO hyperparameters. The parameter

λ

in Table 2 is a parameter that is used to calculate the estimated advantage

{\hat{A}}_{t}

in Equation (12) with the Generalized Advantage Estimation (GAE) method (see [34] for details).

The weights to balance each term of the reward function were set to

w_{1} = 0.1, w_{2} = 0.75,

w_{3} = 0.1,

and

w_{4} = 0.05

. These values were empirically determined through several experiments.

We used a personal computer with an Intel Core i7-11700K CPU and an NVIDIA GeForce GTX1650 GPU for the simulation experiments. The simulation took approximately one hour per

10^{6}

steps.

4.2. Methods for Comparison

To evaluate the proposed method, we compared it with three other methods. Each of these methods is explained below.

4.2.1. RL Only

The first method for comparison is a method that does not use a reference motion and uses only reinforcement learning. In this method, the target angles of the robot’s joints were determined as

u_{t}^{target} = q_{t} + a_{t}^{RL},

(13)

where

q_{t}

is the vector of the current joint angles, which is a part of observation

o_{t}

(Section 3.3), and

a_{t}^{RL}

is the output of the reinforcement learning, which is calculated in exactly the same way as in the proposed method (Section 3.2).

4.2.2. Imitation Learning 1

As mentioned in Section 2, Wu et al. [16] also proposed a reinforcement learning method using a simple reference motion. They successfully trained a Cassie robot in a simulator, although they did not control a real robot. For comparison, we prepared a similar method. Here, we note that it was impossible to implement exactly the same method as theirs because of differences in the robots (e.g., their method uses information regarding shin springs and pelvis, which is unavailable in our robot). Thus, we replaced their task reward with our

r^{T}

in Equation (7) because

r^{T}

in Equation (7) worked well for our robot in our proposed method. Additionally, we modified the parameter values in Equations (15) and (16) below to maximize the reward obtained after learning. The method that we prepared is as follows.

First, following their method [16], we used only the heights of the left and right soles for the reference motion. Specifically,

z_{t}^{ref, left}

and

z_{t}^{ref, right}

of Equations (2) and (3) were used, and

y_{t}^{ref}

in Equation (1) was not used. In this method,

z_{t}^{ref, left}

and

z_{t}^{ref, right}

were used as the targets for imitation.

We did not perform Bayesian optimization of h and

Δ h

. This is because, as will be explained below, we used the reference motion to define an imitation reward term. By not performing Bayesian optimization, we avoided changing the imitation reward term so that we could plot and compare their values in later figures. h and

Δ h

were fixed at

h = 25

[mm] and

Δ h = 5

[mm]. These are the same parameter values that enable our proposed method to successfully learn bipedal gaits, as will be shown in Section 4.3.

Second, the target angles

u_{t}^{target}

of the robot’s joints were determined using Equation (13), without IK. Instead, as in Wu et al. [16], an imitation reward term was added to the reward function so that the reward was higher when the robot’s motion was closer to the reference motion. Specifically, the reward function r was defined as

\begin{matrix} r = w^{I} r^{I} + w^{T} r^{T}, \\ w^{I} + w^{T} = 1, \end{matrix}

(14)

where

r^{I}

is the imitation reward defined below,

r^{T}

is the task reward defined in Equation (7), and

w^{I}

and

w^{T}

are the weights that balance these rewards.

The imitation reward

r^{I}

was defined as in [16],

\begin{matrix} r^{I_{*}} & = & exp [- \frac{1}{0 . 015^{2}} \sum_{i = 1}^{2} {(h_{{foot}_{i}}^{ref} - h_{{foot}_{i}})}^{2}], \end{matrix}

(15)

\begin{matrix} r^{I} & = & \frac{r^{I_{*}} - B_{lower}}{B_{upper} - B_{lower}}, \end{matrix}

(16)

where

h_{{foot}_{1}}^{ref}

and

h_{{foot}_{2}}^{ref}

are the heights of the soles in the reference motion and are equal to

z_{t}^{ref, left}

and

z_{t}^{ref, right}

in Equations (2) and (3), respectively.

{h_{foot}}_{1}

and

{h_{foot}}_{2}

are the current heights of the robot’s left and right soles above the ground, respectively. Thus, this imitation reward encourages the robot to change the heights of its soles in the same way as

z_{t}^{ref, left}

and

z_{t}^{ref, right}

in Equations (2) and (3). The parameters in Equation (16) were set to

B_{lower} = 0.4

and

B_{upper} = 1.0

. These values were empirically determined though several experiments.

4.2.3. Imitation Learning 2

As we will show in the next section, a walking motion could not be learned using Imitation Learning 1, introduced in the previous section. Therefore, we also prepared another imitation-reward-based method that was similar to Imitation Learning 1 but was closer to our proposed method. This method is the same as Imitation Learning 1 except for the following two aspects.

First, regarding the reference motion, we used

y_{t}^{ref}

,

z_{t}^{ref, left}

, and

z_{t}^{ref, right}

in Equations (1)–(3) instead of using only

z_{t}^{ref, left}

and

z_{t}^{ref, right}

. The parameters of the reference motion were set to

w = 20

[mm],

Δ w = 5

[mm],

h = 25

[mm], and

Δ h = 5

[mm]. These are the same parameter values that enable our proposed method to successfully learn bipedal gaits, as will be shown in Section 4.3.

Second, we used the same IK as the one used in our proposed method (Section 3.5) so that we could determine the joint angles

a_{t}^{IK}

that realized the reference motion. In this method,

a_{t}^{IK}

was used as the target for imitation. To encourage the robot to imitate

a_{t}^{IK}

, the imitation reward term

r^{I}

in Equation (14) was defined as

r^{I} = exp [- 3 \sum_{i = 1}^{10} {(q_{i}^{ref} - q_{i})}^{2}],

where

q_{i}^{ref}

is the i-th element of

a_{t}^{IK}

, i.e., the imitation target angle for the i-th joint, and

q_{i}

is the current angle of the same joint. We used the imitation reward term

r^{I}

, following [7].

4.3. Experimental Results

Figure 4 shows the learning curves of the proposed method (Section 3) and the RL-only method (Section 4.2.1).

In the figure, the result of the proposed method without Bayesian optimization of the reference motion parameters is also plotted. In this case, we set the reference motion parameters to

w = 20

[mm],

Δ w = 5

[mm],

h = 25

[mm], and

Δ h = 5

[mm]. These parameter values were prepared for later use in the experiments of Imitation Learning 1 and 2 shown below. They are the parameter values at which the robot walks in place without falling down when controlled solely by the reference motion, and they were prepared through trial-and-error experiments performed by humans.

The proposed method, both with and without Bayesian optimization, resulted in greater rewards than the RL-only method. Figure 5 shows the motions learned by

5 \times 10^{6}

steps using the proposed method (with Bayesian optimization) and using the RL-only method. With the proposed method, the robot was able to walk forward, in contrast to the RL-only method. These results show that the proposed method enabled faster learning (i.e., learning based on a smaller number of samples) than the RL-only method.

Figure 6 shows the results of the two methods that use imitation rewards, i.e., Imitation Learning 1 in Section 4.2.2 and Imitation Learning 2 in Section 4.2.3. The reference motion parameters were fixed at

w = 20

[mm],

Δ w = 5

[mm],

h = 25

[mm], and

Δ h = 5

[mm], which are the values that allowed learning bipedal walking using the proposed method, as shown in the video in the Supplementary Materials. Figure 6 shows the results in the following four conditions:

$w^{I} = 0.2, w^{T} = 0.8$ ;
$w^{I} = 0.4, w^{T} = 0.6$ ;
$w^{I} = 0.6, w^{T} = 0.4$ ;
$w^{I} = 0.8, w^{T} = 0.2$ .

Note that the case of

w^{I} = 0.0

and

w^{T} = 1.0

was excluded from the experimental conditions above because it was the same as the RL-only condition. The case of

w^{I} = 1.0

and

w^{T} = 0.0

was also excluded because, in this case, the learning was based only on the imitation reward term and the optimal policy was to simply imitate the reference motion that allowed the robot to walk in place.

Figure 6a shows the results of Imitation Learning 1 (Section 4.2.2). It can be seen that the task reward, shown on the right side of the figure, reached a maximum when

w^{I} = 0.2

and

w^{T} = 0.8

. However, the task reward was lower than that of the proposed method shown in Figure 4. The experiment was conducted up to

1 \times 10^{7}

steps, twice as many as in Figure 4, but the robot could not walk, as shown in the video in the Supplementary Materials.

Figure 6b shows the results of Imitation Learning 2 (Section 4.2.3). It can be seen that, as with Imitation Learning 1, the task reward, shown on the right side, reached a maximum when

w^{I} = 0.2

and

w^{T} = 0.8

. Although the task reward was higher than that shown in Figure 6a, it was still lower than that of the proposed method shown in Figure 4. The robot could not walk after

1 \times 10^{7}

steps, as shown in the video in the Supplementary Materials.

These results indicate that the learning of bipedal walking occurred faster when using the proposed method rather than the imitation-reward-based methods. There are many parameters other than

w^{I}

and

w^{T}

in Imitation Learning 1 and 2, and it was not practical to examine all cases, but, at least in our experiments, bipedal walking was not learned when using these methods.

The reason for this result may be that, since the imitation learning methods maximize a reward function consisting of two terms, i.e., the imitation reward

r^{I}

and the task reward

r^{T}

, it is more difficult to find the optimal solution than with our proposed method, in which only the task reward

r^{T}

is maximized.

Comparing the results of Imitation Learning 1 (Figure 6a) and Imitation Learning 2 (Figure 6b) with

w^{I} = 0.2

and

w^{T} = 0.8

, we can see that the task reward was higher in Imitation Learning 2. This result is expected because Imitation Learning 2 has more information than Imitation Learning 1 due to the inclusion of lateral movements in the reference motion and also the use of IK. We can also see that the task reward was higher for Imitation Learning 2 (Figure 6b right) than the RL-only method (Figure 4), suggesting that the reference motion or IK had at least some benefit.

Table 3 shows the performance of the controllers learned with each method. The proposed method with Bayesian optimization resulted in faster walking than the method without Bayesian optimization. Using the RL-only and Imitation Learning 1 and 2 methods, the robot could not move forward, although it did not fall down.

In Figure 4, comparing the cases with and without Bayesian optimization, we can see that the rewards were almost the same, but the rewards were slightly higher in the case with Bayesian optimization. As seen in Table 3, the performance was also higher in the case with Bayesian optimization. In the case with Bayesian optimization, the parameter values of

w, Δ w, h,

and

Δ h

, chosen as a result of the optimization, were values that would cause the robot to fall down if the robot was controlled using only the reference motion. Since it is difficult for humans to set such values in advance, it was not only found that Bayesian optimization was useful in automatically setting the parameters but also that it could sometimes set better values than those set by humans.

5. Real-Robot Experiments

5.1. Real-Robot System

Figure 7 shows the overview of the real-robot system.

The humanoid robot KHR-3HV (ver. 2) used in this study has 22 servo motors. Each servo motor has an encoder inside to measure the rotation angle. Each servo motor was wired to a microcomputer, Teensy 4.0, via an ICS conversion board (Product No. 03121, Kondo Kagaku Co., Ltd., Tokyo, Japan). Teensy 4.0 acquired the data of every joint angle at approximately 250 Hz.

An Inertial Measurement Unit (IMU), BMI088, was attached inside the backpack on the back of the robot to measure the 3-dimensional accelerations and 3-dimensional angular velocities of the torso. The IMU was wired to a microcomputer, a Raspberry Pi 4, which acquired the data at approximately 400 Hz using the I2C protocol. A one-dimensional Kalman filter algorithm [39] was used to reduce noise in the acquired data.

Teensy 4.0 was connected to a host PC via a USB cable, and the host PC received the data of the current joint angles at 250 Hz. The Raspberry Pi 4 was connected to the host PC via a LAN cable, and the host PC received the IMU data at 400 Hz. The host PC used the acquired data to create the input to the neural network. Since the neural network required the input at 25 Hz (Section 4.1), the measurement frequencies of 250 Hz and 400 Hz were sufficiently high. Every time the host PC created the input data, it used the most recently acquired data.

The host PC was equipped with an Intel Core i7-6700K CPU and no GPU. The neural network was not trained in this real-robot experiment; the action

a_{t}^{RL}

was determined using the neural network that had already been trained with our proposed method in the simulation environment. The target angles,

u_{t}^{target}

in Equation (5), were sent to the servo motors via Teensy 4.0.

All communications between the host PC, Teensy 4.0, and Raspberry Pi 4 were performed using a Robot Operating System (ROS) [40].

5.2. Domain Randomization

To cope with the gap between the simulation and real environments, we performed domain randomization [41] during learning in the simulation environment. Specifically, random external forces and noises were added during simulation using the parameters shown in Table 4. For example, a random external force whose strength was sampled from a uniform distribution, ranging from 0 to 3 [N], was applied to the robot’s center of mass from a direction randomly sampled from the four directions along the x- or y-axis in the x–y plane. The duration of each force application was fixed at 50 ms (not shown in Table 4). The motor noise and offset in Table 4 are the noise and offset that occurred when the rotation angle of each servo motor was measured. The other noises and offsets in Table 4 are those that occurred when the angular velocities and accelerations were measured by the IMU.

The effectiveness of domain randomization has been verified in many papers [10,13,24,27], and many parameters, such as robot friction, action delays, and motor torques, have been randomized. However, since excessive randomization leads to increased learning times and conservative policies, we performed randomization for as few parameters as possible in this study.

The parameter values regarding the external force were determined empirically through several experiments. The values for the motor noise, motor offset, and IMU noises were determined through hardware experiments in which we measured actual values. The IMU offsets were determined based on the data sheet [42].

5.3. Experimental Results

The policy learned in the simulation environment using the proposed method was used to control the real robot, and the robot successfully walked in zero shots, i.e., with no additional learning. The robot’s gait is shown in Figure 8 and in the video in the Supplementary Materials.

The walking speed of the real robot was 1.9 cm/s, which is almost the same as the speed of

1.79 \pm 0.23

cm/s observed in the simulation (Table 3). The stride was 3.8 cm. The single support phase (SSP), i.e., the phase in which only one sole was touching the ground, took 1.92 s of the whole 2.00 s cycle. It was reasonable for the robot to have such a long SSP in this experiment because the target walking speed

v_{x}^{cmd}

in the velocity reward (Section 3.4.2) was set to a large value (0.15 m/s), and the robot learned to walk as quickly as possible.

6. Conclusions

In this study, we proposed a reinforcement learning method to acquire a bipedal walking controller using only a simple reference motion. The reference motion was a simple motion consisting of three sine waves, and its parameters could be automatically set using Bayesian optimization. To facilitate reinforcement learning, we introduced two means: the use of inverse kinematics (IK) and the use of the reference motion as a bias. Numerical experiments showed that, with the proposed method, a bipedal walking controller can be learned using a smaller number of samples, compared with a method using reinforcement learning only and with methods using imitation rewards. Furthermore, experiments using a real robot, KHR-3HV (ver. 2), showed that, using a sim-to-real technique using domain randomization, a real robot could walk with the controller acquired in a simulator using the proposed method.

Since the present study was limited to controlling a small humanoid robot, an important future work would be to apply the proposed method to other robots, such as human-sized robots and robots of different shapes. In the present study, the previous method [16], which was successfully trained on a robot with bird-like legs, was not successful on our humanoid robot. Thus, it would be important to investigate the range of applicability of our proposed method.

Another line of future work is to apply the proposed method to the learning of more complex gaits. The present study was limited to a case in which the robot was encouraged to walk as quickly as possible. Learning various walking patterns, such as stopping after a certain distance or walking slowly, remains an issue for future work. Using additional sensors may also be useful to achieve a variety of walking behaviors. For example, combinations with vision sensors utilizing advanced processing methods (e.g., [43]) may be used to enable walking on stairs or in places with various obstacles.

Another future issue is to provide some guarantees for walking performance. There are recent studies on tracking control that can theoretically guarantee prescribed performance (e.g., [14,15]). It would be desirable to provide such guarantees in the reinforcement learning of bipedal walking.

Additionally, although the learning of bipedal walking was possible based on a relatively small number of samples when using our proposed method, the further acceleration of learning—for example, by parallelizing the learning with multiple simulations [44]—could be a future challenge.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app14051803/s1. Video S1: Motions learned.

Author Contributions

Conceptualization, N.I. and H.I.; methodology, N.I.; software, N.I.; validation, N.I. and H.I.; formal analysis, N.I.; investigation, N.I.; resources, H.I. and H.F.; data curation, N.I. and H.I.; writing—original draft preparation, N.I.; writing—review and editing, H.I., H.F. and H.W.; visualization, N.I.; supervision, H.I., H.F. and H.W.; project administration, H.I., H.F. and H.W.; funding acquisition, H.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by JSPS KAKENHI Grant Number JP 22K12195.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Miura, H.; Shimoyama, I. Dynamic walk of a biped. Int. J. Robot. Res. 1984, 3, 60–74. [Google Scholar] [CrossRef]
Kajita, S.; Morisawa, M.; Miura, K.; Nakaoka, S.; Harada, K.; Kaneko, K.; Kanehiro, F.; Yokoi, K. Biped Walking Stabilization Based on Linear Inverted Pendulum Tracking. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 4489–4496. [Google Scholar]
Vukobratović, M.; Borovac, B. Zero-moment point—Thirty five years of its life. Int. J. Humanoid Robot. 2004, 1, 157–173. [Google Scholar] [CrossRef]
Chevallereau, C.; Djoudi, D.; Grizzle, J.W. Stable bipedal walking with foot rotation through direct regulation of the zero moment point. IEEE Trans. Robot. 2008, 24, 390–401. [Google Scholar] [CrossRef]
Peng, X.B.; Abbeel, P.; Levine, S.; Van de Panne, M. DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. 2018, 37, 143:1–143:14. [Google Scholar] [CrossRef]
Hou, L.; Wang, H.; Zou, H.; Wang, Q. Efficient robot skills learning with weighted near-optimal experiences policy optimization. Appl. Sci. 2021, 11, 1131. [Google Scholar] [CrossRef]
Zhang, W.; Jiang, Y.; Farrukh, F.; Zhang, C.; Zhang, D.; Wang, G. LORM: A novel reinforcement learning framework for biped gait control. PeerJ Comput. Sci. 2022, 8, e927. [Google Scholar] [CrossRef] [PubMed]
Kaymak, Ç.; Uçar, A.; Güzeliş, C. Development of a new robust stable walking algorithm for a humanoid robot using deep reinforcement learning with multi-sensor data fusion. Electronics 2023, 12, 568. [Google Scholar] [CrossRef]
Huang, C.; Wang, G.; Zhou, Z.; Zhang, R.; Lin, L. Reward-adaptive reinforcement learning: Dynamic policy gradient optimization for bipedal locomotion. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7686–7695. [Google Scholar] [CrossRef]
Rodriguez, D.; Behnke, S. DeepWalk: Omnidirectional Bipedal Gait by Deep Reinforcement Learning. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 3033–3039. [Google Scholar] [CrossRef]
Heess, N.; TB, D.; Sriram, S.; Lemmon, J.; Merel, J.; Wayne, G.; Tassa, Y.; Erez, T.; Wang, Z.; Eslami, S.M.A.; et al. Emergence of locomotion behaviours in rich environments. arXiv 2017, arXiv:1707.02286. [Google Scholar]
Xie, Z.; Berseth, G.; Clary, P.; Hurst, J.; van de Panne, M. Feedback Control For Cassie With Deep Reinforcement Learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 1241–1246. [Google Scholar] [CrossRef]
Taylor, M.; Bashkirov, S.; Rico, J.F.; Toriyama, I.; Miyada, N.; Yanagisawa, H.; Ishizuka, K. Learning Bipedal Robot Locomotion from Human Movement. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2797–2803. [Google Scholar] [CrossRef]
Zhang, J.X.; Yang, T.; Chai, T. Neural network control of underactuated surface vehicles with prescribed trajectory tracking performance. IEEE Trans. Neural Netw. Learn. Syst. 2022, 1–14. [Google Scholar] [CrossRef]
Xie, H.; Zhang, J.X.; Jing, Y.; Dimirovski, G.M.; Chen, J. Self-adjustable performance-based adaptive tracking control of uncertain nonlinear systems. IEEE Trans. Autom. Sci. Eng. 2024, 1–15. [Google Scholar] [CrossRef]
Wu, Q.; Zhang, C.; Liu, Y. Custom Sine Waves Are Enough for Imitation Learning of Bipedal Gaits with Different Styles. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA), Guilin, China, 7–10 August 2022; pp. 499–505. [Google Scholar] [CrossRef]
Shi, F.; Kojio, Y.; Makabe, T.; Anzai, T.; Kojima, K.; Okada, K.; Inaba, M. Reference-Free Learning Bipedal Motor Skills via Assistive Force Curricula. In Proceedings of the International Symposium of Robotics Research, Geneva, Switzerland, 25–30 September 2022; pp. 304–320. [Google Scholar]
Singh, R.P.; Xie, Z.; Gergondet, P.; Kanehiro, F. Learning bipedal walking for humanoids with current feedback. IEEE Access 2023, 11, 82013–82023. [Google Scholar] [CrossRef]
Tutsoy, O.; Erol Barkana, D.; Colak, S. Learning to balance an NAO robot using reinforcement learning with symbolic inverse kinematic. Trans. Inst. Meas. Control. 2017, 39, 1735–1748. [Google Scholar] [CrossRef]
Nguyen, V.T.; Bui, T.; Hasegawa, H. A gait generation for biped robot based on artificial neural network and improved self-adaptive differential evolution algorithm. Int. J. Mach. Learn. Comput. 2016, 6, 260. [Google Scholar] [CrossRef]
Dutta, S.; Maiti, T.K.; Miura-Mattausch, M.; Ochi, Y.; Yorino, N.; Mattausch, H.J. Analysis of sensor-based real-time balancing of humanoid robots on inclined surfaces. IEEE Access 2020, 8, 212327–212338. [Google Scholar] [CrossRef]
Bhattacharya, S.; Dutta, S.; Luo, A.; Miura-Mattausch, M.; Ochi, Y.; Mattausch, H.J. Energy efficiency of force-sensor-controlled humanoid-robot walking on indoor surfaces. IEEE Access 2020, 8, 227100–227112. [Google Scholar] [CrossRef]
Sugihara, K.; Zhao, M.; Nishio, T.; Makabe, T.; Okada, K.; Inaba, M. Design and control of a small humanoid equipped with flight unit and wheels for multimodal locomotion. IEEE Robot. Autom. Lett. 2023, 8, 5608–5615. [Google Scholar] [CrossRef]
Siekmann, J.; Valluri, S.; Dao, J.; Bermillo, L.; Duan, H.; Fern, A.; Hurst, J. Learning Memory-Based Control for Human-Scale Bipedal Locomotion. In Proceedings of the Robotics: Science and Systems (RSS), Virtual Event, 12–17 July 2020; pp. 1–8. [Google Scholar]
Xie, Z.; Clary, P.; Dao, J.; Morais, P.; Hurst, J.; van de Panne, M. Learning Locomotion Skills for Cassie: Iterative Design and Sim-to-Real. Proc. Mach. Learn. Res. 2020, 100, 317–329. [Google Scholar]
Denavit, J.; Hartenberg, R.S. A kinematic notation for lower-pair mechanisms based on matrices. J. Appl. Mech. 1955, 22, 215–221. [Google Scholar] [CrossRef]
Li, Z.; Peng, X.B.; Abbeel, P.; Levine, S.; Berseth, G.; Sreenath, K. Robust and Versatile Bipedal Jumping Control through Reinforcement Learning. In Proceedings of the Robotics: Science and Systems XIX, Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar] [CrossRef]
Li, Z.; Cheng, X.; Peng, X.B.; Abbeel, P.; Levine, S.; Berseth, G.; Sreenath, K. Reinforcement learning for robust parameterized locomotion control of bipedal robots. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2811–2817. [Google Scholar] [CrossRef]
Holden, D.; Komura, T.; Saito, J. Phase-functioned neural networks for character control. ACM Trans. Graph. 2017, 36, 1–13. [Google Scholar] [CrossRef]
Masuda, S.; Takahashi, K. Sim-to-real transfer of compliant bipedal locomotion on torque sensor-less gear-driven humanoid. In Proceedings of the 22nd IEEE-RAS International Conference on Humanoid Robots (Humanoids 2023), Austin, TX, USA, 12–14 December 2023; pp. 1–8. [Google Scholar] [CrossRef]
Močkus, J.; Tiešis, V.; Žilinskas, A. The Application of Bayesian Methods for Seeking the Extremum; Towards Global Optimization; North-Holand Publishing Company: Amsterdam, The Netherlands, 1978; Volume 2, pp. 117–129. [Google Scholar]
Bergstra, J.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for Hyper-Parameter Optimization. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems, Granada, Spain, 12–15 December 2011; Volume 24. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Kuo, P.H.; Pao, C.H.; Chang, E.Y.; Yau, H.T. Deep-reinforcement-learning-based gait pattern controller on an uneven terrain for humanoid robots. Int. J. Optomechatronics 2023, 17, 2222146. [Google Scholar] [CrossRef]
Michel, O. Cyberbotics Ltd. Webots™: Professional mobile robot simulation. Int. J. Adv. Robot. Syst. 2004, 1, 5. [Google Scholar] [CrossRef]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540. [Google Scholar]
Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus, M.; Dormann, N. Stable-Baselines3: Reliable reinforcement learning implementations. J. Mach. Learn. Res. 2021, 22, 12348–12355. [Google Scholar]
Alfian, R.I.; Ma’arif, A.; Sunardi, S. Noise reduction in the accelerometer and gyroscope sensor with the Kalman filter algorithm. J. Robot. Control 2021, 2, 180–189. [Google Scholar] [CrossRef]
Quigley, M.; Conley, K.; Gerkey, B.P.; Faust, J.; Foote, T.; Leibs, J.; Wheeler, R.; Ng, A.Y. ROS: An Open-source Robot Operating System. ICRA Workshop Open Source Softw. 2009, 3, 5. [Google Scholar]
Peng, X.B.; Andrychowicz, M.; Zaremba, W.; Abbeel, P. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018. [Google Scholar] [CrossRef]
Bosch Sensortec. Inertial Measurement Unit BMI088. Available online: https://www.bosch-sensortec.com/products/motion-sensors/imus/bmi088/. (accessed on 17 January 2024).
Yan, H.; Zhang, J.X.; Zhang, X. Injected infrared and visible image fusion via L₁ decomposition model and guided filtering. IEEE Trans. Comput. Imaging 2022, 8, 162–173. [Google Scholar] [CrossRef]
Rudin, N.; Hoeller, D.; Reist, P.; Hutter, M. Learning to Walk in Minutes using Massively Parallel Deep Reinforcement Learning. In Proceedings of the Conference on Robot Learning, London, UK, 8–11 November 2021; pp. 91–100. [Google Scholar]

Figure 1. Overview of the proposed method.

Figure 2. Schematic view of the left leg of the KHR-3HV robot.

(X_{B}, Y_{B}, Z_{B})

is the base coordinate frame attached to the torso, and

(X_{6}, Y_{6}, Z_{6})

is the coordinate frame of the sole of the left foot.

Figure 2. Schematic view of the left leg of the KHR-3HV robot.

(X_{B}, Y_{B}, Z_{B})

is the base coordinate frame attached to the torso, and

(X_{6}, Y_{6}, Z_{6})

is the coordinate frame of the sole of the left foot.

Figure 3. The reference motion. The graph at the top shows the waveforms of Equations (1)–(3):

y_{t}^{ref}

(blue) is the lateral displacement of the soles of both feet,

z_{t}^{ref, left}

(red) is the sole height of the left foot, and

z_{t}^{ref, right}

(green) is the sole height of the right foot. The lower part of the figure shows the posture of the robot’s lower body at each time point. Our reference motion is a motion within the y–z plane. Note that Phases 1 to 9 in the figure are intended for descriptive purposes only and are not used in the proposed method.

Figure 3. The reference motion. The graph at the top shows the waveforms of Equations (1)–(3):

y_{t}^{ref}

(blue) is the lateral displacement of the soles of both feet,

z_{t}^{ref, left}

(red) is the sole height of the left foot, and

z_{t}^{ref, right}

(green) is the sole height of the right foot. The lower part of the figure shows the posture of the robot’s lower body at each time point. Our reference motion is a motion within the y–z plane. Note that Phases 1 to 9 in the figure are intended for descriptive purposes only and are not used in the proposed method.

Figure 4. Learning curves of the proposed method and RL-only method. The vertical axis is the reward obtained in each episode, averaged over the most recent 100 episodes. For the proposed method, the results with and without Bayesian optimization of the reference motion parameters are plotted. For each method, the same experiment was repeated three times using different random seeds, and the means and standard deviations are plotted.

Figure 5. Motions learned in the simulated environment. (a) Using the proposed method, the robot successfully learned bipedal walking. (b) In the case of RL only, the robot could not learn to walk. A video of these motions is available in the Supplementary Materials. The video also includes the motion learned using the proposed method without Bayesian optimization.

Figure 6. Learning curves of imitation-based methods: (a) Imitation Learning 1 (Section 4.2.2); (b) Imitation Learning 2 (Section 4.2.3). In both cases, the left graphs show the imitation reward

r^{I}

and the right graphs show the task reward

r^{T}

.

Figure 6. Learning curves of imitation-based methods: (a) Imitation Learning 1 (Section 4.2.2); (b) Imitation Learning 2 (Section 4.2.3). In both cases, the left graphs show the imitation reward

r^{I}

and the right graphs show the task reward

r^{T}

.

Figure 7. Overview of the system for control of the real robot.

Figure 8. Gait of the real robot. The motion is shown in the video in the Supplementary Materials.

Table 1. DH parameters of the left leg.

Joint	$a_{i}$ [mm]	$α_{i}$ [rad]	$d_{i}$ [mm]	$φ_{i}$ [rad]
1	0	$\frac{π}{2}$	23	$φ_{1}$ *
2	41	$- \frac{π}{2}$	0	$φ_{2} - \frac{π}{2}$
3	65	0	0	$φ_{3}$
4	65	0	0	$φ_{4}$
5	41	$\frac{π}{2}$	0	$φ_{5}$
6	25	0	0	$φ_{6}$

* In the present study,

φ_{1}

is fixed at 0 [rad]. Only the angles

φ_{2}

to

φ_{6}

are controlled.

Table 2. PPO hyperparameters.

Hyperparameter	Value
Time steps per actor batch	2048
Batch size	128
Optimization epochs	10
Learning rate	$3 \times 10^{- 4}$
Discount factor $γ$	$0.99$
GAE $λ$	$0.95$
Clip parameter	$0.2$
Entropy coefficient	$0.0$

Table 3. Walking performance of the learned controllers.

Method	Walking Distance in 750 Steps (i.e., 30 s) [cm]	Speed [cm/s]
Proposed method	$53.8 \pm 6.9$	$1.79 \pm 0.23$
Proposed method w/o Bayesian optimization	$33.7 \pm 2.0$	$1.12 \pm 0.07$
RL only	-	-
Imitation Learning 1	-	-
Imitation Learning 2	-	-

Table 4. Domain randomization parameters.

Parameter	Value
Applied force $[N]$	$U (0, 3)$
Direction of the applied force $[\deg]$	${0, 90, 180, 270}$
Interval between force applications $[ms]$	$U (50, 90)$
Motor noise $[rad]$	$U (0, 0.0087)$
Motor offset $[rad]$	$U (- 0.0175, 0.0175)$
Gyro noise $[rad / s]$	$N (0, 0.005)$
Gyro offset $[rad / s]$	$U (- 0.0175, 0.0175)$
Accelerometer noise (x and y axis) $[m / s^{2}]$	$N (0, 0.08)$
Accelerometer noise (z axis) $[m / s^{2}]$	$N (0, 0.12)$
Accelerometer offset $[m / s^{2}]$	$U (- 0.29, 0.29)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Itahashi, N.; Itoh, H.; Fukumoto, H.; Wakuya, H. Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion. Appl. Sci. 2024, 14, 1803. https://doi.org/10.3390/app14051803

AMA Style

Itahashi N, Itoh H, Fukumoto H, Wakuya H. Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion. Applied Sciences. 2024; 14(5):1803. https://doi.org/10.3390/app14051803

Chicago/Turabian Style

Itahashi, Naoya, Hideaki Itoh, Hisao Fukumoto, and Hiroshi Wakuya. 2024. "Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion" Applied Sciences 14, no. 5: 1803. https://doi.org/10.3390/app14051803

APA Style

Itahashi, N., Itoh, H., Fukumoto, H., & Wakuya, H. (2024). Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion. Applied Sciences, 14(5), 1803. https://doi.org/10.3390/app14051803

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reinforcement Learning of Bipedal Walking Using a Simple Reference Motion

Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Reference Motion

3.2. Action and Target Joint Angle

3.3. State

3.4. Reward

3.4.1. Orientation Reward r o

3.4.2. Velocity Reward r v

3.4.3. Symmetry Reward r s

3.4.4. Survival Reward r a

3.4.5. Torque Penalty r f

3.5. Inverse Kinematics

3.6. Optimizing the Parameters of the Reference Motion

3.7. Optimizing π θ

4. Numerical Experiments

4.1. Simulation Setups

4.2. Methods for Comparison

4.2.1. RL Only

4.2.2. Imitation Learning 1

4.2.3. Imitation Learning 2

4.3. Experimental Results

5. Real-Robot Experiments

5.1. Real-Robot System

5.2. Domain Randomization

5.3. Experimental Results

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. Orientation Reward $r^{o}$

3.4.2. Velocity Reward $r^{v}$

3.4.3. Symmetry Reward $r^{s}$

3.4.4. Survival Reward $r^{a}$

3.4.5. Torque Penalty $r^{f}$

3.7. Optimizing $π_{θ}$