The main aim of using DRL for solving guidance and control problems is to attain the desired flying object’s motion state by studying the guidance instructions via the trial-and-error method. There are two main approaches to the design of guidance and control systems: one is to design the guidance loop and control loop separately and independently, and another is to integrate the guidance and control loops into an overall control system. Therefore, this study applies the DRL method to design two distinct guidance and control strategies with the miss distance and impact-angle constraints.
3.1. DRL Model
The flight state of a flying object at two adjacent moments can be approximately regarded as a transfer between the states under a given control command, and the flying object’s state in the next moment is related only to its state at the current moment; thus, the flight state of the flying object has a Markov property. By discretizing the flight process in the time domain, the guidance process of a flying object can be approximately modeled as a discrete-time Markov chain (DTMC). The specific design of the guidance process is to add the decision-making command to the Markov chain of the flying object flight process; therefore, the design process of the flying object guidance can be modeled as a Markov decision process (MDP).
The MDP is a sequential decision process that can be described by a five-tuple denoted by
. The specific model of the MDP for a guidance and control definition is as follows [
13]:
The interaction process between the agent and the environment in the MDP is depicted in
Figure 5. The environment produces information describing the state
, and the agent interacts with the environment by observing the state
and choosing action
. The environment accepts the action
and transitions to the next state
, then returns the next state
and a reward to the agent. In this study, an agent cannot directly access the transition function
and the reward function
and can obtain only specific information about its state
, action
, and reward
by interacting with the surrounding environment.
The guidance control problem represents a continuous control problem. In each round, an agent observes its state at time t and makes a decision on an action to be taken according to the current policy. The policy defines the mapping from a particular agent state to the corresponding action (). The guidance control model defines the next state under the performed action and obtains a reward from the environment. The trajectory generated by the control loop from the initial state s0 to the final state is expressed as follows: , , ⋯, .
According to the optimization objective, the reward is defined as a weighted sum of all rewards on trajectory
:
where
represents the discount factor and
T represents the state number of a scene data trajectory.
The objective function
is defined as an expectation of the trajectory return, and it is expressed as follows:
Before implementing the DRL algorithm for model training, it is necessary to define the amount of information required for the interaction between an agent and the environment; namely, it is necessary to define the state st, action at, and reward rt.
In the guidance and control problems with impact-angle constraints, the state vector must fully describe a flying object’s motion and target geometry while considering the guidance constraints. Therefore, the state vector is defined as follows:
where
represents the error between the actual and expected values of the impact angle.
In this study, one of the state components is designed as an angle error instead of a specified impact angle. The idea of this modification is to encourage the RL to focus more on the angle error rather than on the precise impact angle. In this way, the suitability for different impact-angle requirements of the guidance model can be increased.
Standardizing the state vector is necessary to eliminate the impact of dimensionality on neural network training. In this study, the mean and standard deviation of each state component are used to standardize the state vector. The state vector normalization is performed using the following formula:
where
represents the components of the state vector before normalization;
represents the components of the state vector after normalization; and
and
are the mean and standard deviation of
, respectively; the
and
values are updated in the training process through the single-step updating of the sampled data. Initially, their values may exhibit significant deviation, but as the network deepens, they become increasingly close to their true values.
The values of
and
are updated as follows:
where
and
represent the expected value and standard deviation of the state variables before the update, respectively;
n is the current number of sampling steps; and
represents the state variables at the
nth step of sampling.
In the DLGCIAC-DRL strategy, an agent obtains an acceleration command
by sampling the probability distribution selected in the policy
. As a result, it configures the action
as an acceleration command
. Therefore, to ensure safety, it is imperative to constrain the acceleration command
as follows:
where
represents the upper limit of the acceleration command.
Similarly, in the IGCIAC-DRL strategy, an action can be designed as follows:
where
represents the upper limit of the elevator deflection angle command.
In this study, policy
adopts the Beta distribution and normalizes actions to the range of [0, 1] during the sampling process. Therefore, it is necessary to reverse the normalization of action values during the interaction between an agent and its environment. The reverse normalization formula for action values in the DLGCIAC-DRL strategy is given by:
Similarly, in the IGCIAC-DRL strategy, an action can be designed as follows:
The reward signal represents an objective that agents need to maximize. Therefore, using a reasonable reward mechanism is crucial to achieving the optimal training effects as this mechanism directly affects the convergence rate and even the feasibility of the RL algorithms. In addition to constraining the impact angle, the guidance processes need to also constrain the off-target deviation. This study designs the reward function as a function related to the state components as follows:
where
represents the value of the current state component,
is the target value of the state component of an intelligent agent, and
is the scaling coefficient of the intelligent agent.
In order to describe the advantages of this reward function design in more detail, a specific example will be given below to explain. For instance, for the
value of 0.02, the reward function is shown in
Figure 6.
As shown in
Figure 6, the closer the state component is to the target value, the greater the reward value is. Moreover, the reward value changes dramatically near the target value, which ensures that an agent can obtain a large reward difference even when making decisions near the target value, which makes the network easier to converge to the optimal solution. Moreover, the proposed reward function normalizes the reward values to the range of [−1, 1], which is beneficial to reducing the impact of different scalar reward values on the overall reward value and addressing the problem of sparse rewards. The total reward value represents a weighted sum of the rewards of all relevant state components, and it is defined by:
where
represents the weight coefficient, indicating the significance level of the
ith state component to the intelligent agent, and it satisfies the condition of
= 1.
In practical engineering, the main focus is on the impact-angle error and the distance
R between a flying object and a target, so the reward value can be calculated by:
Further, to increase the convergence speed of the training process, additional sparse rewards are introduced in addition to the aforementioned reward. Therefore, the final form of the reward can be obtained as follows:
According to Equation (16), it can be observed that there is an additional reward of 20 −
R when
R is less than 20 m, an additional penalty of −50 when
H is over 6000 m, and an additional penalty of −20 when the
H equals 0 while the flying object is still far from the target, i.e., more than 100 m. These extra rewards and penalties contribute to improving the training efficiency. The values of the relevant parameters of the reward function are listed in
Table 1.
3.2. PPO Algorithm and Enhancement
The PPO algorithm represents an on-policy method based on RL that uses an actor-critic framework. The PPO employs a stochastic policy, denoted by
, where each action is considered a random variable that follows a predefined probability distribution, and to implement the policy, the probability distribution should be parameterized. The PPO uses the output of an actor network as the relevant parameters to select the corresponding actions by sampling from the distributed set [
22].
The PPO typically uses Gaussian distribution as a probability distribution for policy sampling. However, due to the Gaussian distribution’s infinite range, sampled actions are often restricted within the acceptable action boundaries, which can affect algorithm efficiency. In view of that, this study proposes an alternative bounded probability distribution, the Beta distribution, to address this limitation. The Beta distribution is a probability distribution determined by two parameters
a and
b, and the domain is [0, 1]. The sampling value obtained through the Beta distribution must be in the interval [0, 1] so that the experiment can map the sampling value in the interval [0, 1] to any desired action interval. Beta distribution can be expressed as follows:
The probability density function (PDF) and the cumulative distribution function (CDF) of the Beta distribution are presented in
Figure 7.
As shown in
Figure 7, using the Beta distribution for sampling can restrict actions within the interval of [0, 1] so the sample values can be mapped to any desired action range, which eliminates the detrimental effect of unbounded distribution on the action space that is limited.
The PPO algorithm stores the trajectories obtained after sampling into a replay buffer. The policy used in the sampling process is referred to as an old policy, denoted by
, and it differs from the updated policy. To obtain the total environmental rewards at each state, the expected return is typically expressed in the form of value functions. Particularly, the state value function
and the action value function
are defined as follows:
The value of
is strongly related to the value of
, which represents the anticipated
Q-value of a feasible action
for a policy
and a particular state
st, and it is calculated by:
The critic network of the PPO learns the mapping relationship between a state
and the value function
. Namely, the critic network takes a state
as input and generates the value function
as output. The PPO employs the advantage function
as a reinforcement signal, which measures the superiority of a specific action in a particular state compared to the average policy. The advantage function is defined by:
This paper applies the generalized advantage estimation (GAE) method to estimate the advantage function and reduce the variance of estimates while minimizing bias. Specifically, GAE calculates the exponentially weighted average of all n-step return advantage functions:
where
represents the GAE parameter, and
can be obtained by computing the advantage function using the old critic network as follows:
During the training process, the critic network’s parameters
are adjusted to ensure a close approximation of the output
to the desired value
. Consequently, the loss function of the critic network can be formulated as follows:
where
represents the size of a single batch update.
Further, the critic network’s parameters are updated as follows:
where
represents the learning rate of the critic network.
The actor network of the PPO algorithm takes the input state
and determines the values of parameters
a and
b for the Beta distribution used by the policy
. The PPO algorithm combines the benefits of the policy gradient (PG) and trust region policy optimization (TRPO). By leveraging the concept of importance sampling, the TRPO considers every policy update step as an optimization problem. The TRPO introduces a modified surrogate objective function (SOF), which is expressed as follows:
where
is a constant, and
B is the Kullback–Leibler (KL) divergence between the old and new strategies which is used to measure the difference in the probability distribution between the old and new strategies.
Therefore, at each update of the policy, can be optimized using samples collected by the old policy and the advantage function estimated by the old critic network.
The TRPO uses a constant
A to constrain the magnitude of policy updates, but computing the KL divergence can be relatively complex in practice and, thus, difficult to implement. In contrast, the PPO directly modifies the SOF via a clipping function to restrict the magnitude of policy updates, which simplifies the computational process. To encourage the exploration using a variety of actions, in this study, the PPO objective function includes an entropy regularization term, and it can be expressed as follows:
where
is the clipping parameter used for the constraint policy update;
is the entropy regularization factor;
is the policy entropy;
is a trimming function; and
is the ratio function; it is defined as follows:
Further,
is defined as follows:
The relationship between the objective function
and the ratio function
is presented in
Figure 8, where it can be seen that increasing the probability of the corresponding action
can increase the objective function’s value when
. Nevertheless, if
surpasses
, the objective function will be truncated and calculated as
, which obstructs
from deviating excessively from
. Analogously, when
, the similar changing trend of the objective function can be observed.
The updating formula of parameters
of the actor network is as follows:
where
represents the learning rate of the actor network.
To ensure training stability in the later stages, this study decreases the learning rate during the training process. The learning rate updating method is defined as follows:
where
represents the decay factor of the learning rate,
is the total number of training iterations, and
denotes the current number of training iterations.
3.3. Network Structure and Learning Process
In this study, both the actor and critic networks of the PPO algorithm represent a five-layered structure consisting of an input layer, three hidden layers, and an output layer. Through attempts, the specific parameters of the network structure for this study are determined, as seen in
Table 2. The numbers in the table represent the number of neurons in the layers.
In every network layer except in the output layer, an activation function of hyperbolic tangent (
Tanh) is used to activate each layer of the networks and prevent input saturation, defined as follows:
Since the parameters of the Beta distribution are required to be larger than one, the
Softplus function is used in the output layer of the actor network. Namely, adding one to the output value of the
Softplus function ensures meeting the requirement for the Beta distribution parameters as the output of the
Softplus function is always larger than zero. The
Softplus function is defined as follows:
The output layer of the actor network does not have an activation function. The specific structures of the actor and critic networks are shown in
Figure 9.
As mentioned before, RL is a machine learning technique where an agent interacts with the surrounding environment to maximize its reward by continuously learning and optimizing its policy. During the agent–environment interaction, RL updates the agent state st based on the motion equations and obtains an action at using the probability distribution of the policy sampling. Then, it calculates the real-time reward based on the state components. In a single learning process, first, the RL-based algorithm interacts times with the environment using the old policy to generate a trajectory sequence denoted by , where is the length of the replay buffer used in the PPO algorithm. Next, it stores the generated trajectory into the replay buffer. To improve the training efficiency, this study performs batch training on the data acquired by sequentially processing a number of -length trajectories in the replay buffer.
In the actor network updating process, first, the advantage values are estimated using Equation (23). Then, the action probability is calculated for the already taken actions using the probability density function of the Beta distribution obtained from the old policy . Next, the action probability is calculated based on the updated policy . Then, the ratio of action probabilities is obtained to compute the objective function using Equation (28). Finally, the Adam optimizer is employed to update the actor network parameters and maximize the objective function value.
In the critic network updating process, the state value function value is calculated based on the updated critic network and the loss function using Equation (25). Then, the Adam optimizer is employed to update the critic network parameters to minimize the loss. During a single learning process, the network parameters are updated five times based on the data stored in the replay buffer.
The PPO is a typical on-policy algorithm where the sequence trajectory
generated from the agent–environment interaction cannot be reused. After each learning iteration, the replay buffer is cleared, and the learning process is repeated until the training process is completed. The learning process of the proposed guidance and control strategy based on the PPO algorithm is illustrated in
Figure 10.