In this paper, obstacles are regarded as forms of reinforcement learning intelligence that teach us to control the size of the artificial potential field through data interacting with UAVs in the environment, which collaborate to generate superimposed potential fields to optimize UAV routes. Combining deep reinforcement learning and artificial potential field algorithms, this paper proposes SAPF. The repulsive potential field of SAPF changes with the position of the UAV, and the local minimum point moves along with it. During the training process, the penalty mechanism is triggered when falling into the local minimum, and this operation effectively avoids the local minimum problem of the traditional potential field algorithm.
4.2. Decision-Making Process Design
(1) State Space
In the framework of reinforcement learning applied to UAV path planning, the state space forms the basis of the model’s policy decisions. In order to fully describe the environmental information of the UAV, the state space designed in this paper is given as follows:
3D coordinates of the UAV: the state variable () represents the 3D position of the UAV at the current moment. These coordinates are used to determine the exact position of the UAV in space and are the basic information for path planning.
Distance between the UAV and the nearest obstacle: the state variable
represents the distance between the UAV and the nearest obstacle and is calculated as follows:
Here, are the 3D coordinates of the nearest obstacle. This variable is used to evaluate the risk of collision between the UAV and the obstacle and is a key input for the obstacle avoidance decision.
The distance between the UAV and the target point: the state variable
represents the distance between the UAV and the target point, calculated as follows:
Here, () is the relative position of the target point to the UAV. This variable is used to evaluate the proximity of the UAV to the target point, which is an important basis for path optimization.
With the above state variables, the state space, s, can be represented as follows:
The equation shows the state, s, reinterpreted as information about the position of the UAV in 3D space relative to obstacles and the position of the UAV relative to the target point. Together, this information forms a comprehensive view of the strategy inputs. As shown by equations corresponding one-to-one with the UAV’s position in space, there are no duplicate states; thus, the selection of s is reasonable.
(2) Action Space
The action space is a set representing all possible flight actions. The UAV selects an action from this set at each step to interact with the environment (i.e., the flight space) to achieve its path planning goal. The design of the maneuver space should be based on the motion characteristics of the UAV in 3D space, ensuring smooth and dynamically correct motion while keeping the design simple and efficient to meet real-time requirements. In this paper, we used the action space defined as the displacement increment of the UAV in 3D space, denoted as follows:
where
is in the range [
,
] and
is the maximum displacement.
(3) Reward mechanism
The reward mechanism of reinforcement learning is a crucial factor affecting the learning effect of the algorithm, and the reward value should reflect the key characteristics of the problem to be solved. The reward mechanism in reinforcement learning has a decisive impact on the learning effectiveness of the algorithm, and its design should be closely centered on the core features of the problem to be solved. In order to ensure the stability and efficiency of the learning process, each reward value needs to maintain the consistency of the numerical range and minimize the variance of the reward in order to avoid drastic fluctuations in the cumulative reward. In the pursuit of high-performance trajectories, it is often required that the trajectories are both short and do not intersect with obstacles while maintaining a certain degree of smoothness. However, given that the underlying APF algorithm already ensures smoothness to a certain extent and usually satisfies the smoothness requirement when the trajectory is short, the reward mechanism was not designed in this study with special consideration of the smoothness constraint. Therefore, the following reward and penalty mechanisms were designed in this paper:
If the next track point
has an intersection with an obstacle, a penalty mechanism is implemented as follows:
In the formula, this paper defines the intersection point of the trajectory and the obstacle as . For the sphere obstacle, the core point is the center of the sphere; for the cylinder and cone obstacle, the center of its horizontal cross-section circle is taken as the core point. At the same time, this paper sets the radius parameter of the obstacle for the sphere, where the radius is the actual radius of the sphere; for cylinders and cones, the radius of the horizontal cross-section circle is taken as the standard. is a constant.
If the next track point
has no intersection with an obstacle, the reward value should tend to optimize the route length as follows:
In the formula,
denotes the distance between the next track point and the target point;
denotes the distance from the starting point to the target point; and
is a constant to determine whether the track point arrives near the target point. Then, the reward function is given as follows:
Here, and are weighting factors.