The success of reinforcement learning training is closely related to the design of rewards and punishments. Therefore, this study proposes three reward design methods including the random initialization method, random center method, and continuous round method that consider the drone’s position and detail how to modify the boundary conditions and rewards for the experimentation. To bridge the gap between the simulated environment and the real world, this research introduces the YOLOv7-tiny object detection technology and employs two methods to synchronize the input states of both environments. This chapter will focus on the reward and punishment design methods, the various input states, and the experiments involving the different reinforcement learning algorithms.
4.2.1. Random Initialization Method (RIM)
The initial condition design for the RIM involves randomly placing the drone at the center of any target at the start of each training episode, facing the next target. The drone then begins its exploration training. The termination conditions are set as follows: the episode ends when the drone passes through the next target, collides with environmental obstacles, or incurs excessive time penalties. To determine whether the drone has crossed the target, this study records the exact center coordinates
of all the targets and sets a spherical region as the basis for determining whether the drone has passed through a target. A conceptual design diagram is shown in
Figure 10, where the gray area represents the spherical region used to assess whether the drone has passed through the target.
As shown in
Figure 10, during each training episode, the environment will continuously record the drone’s current coordinates
and calculate the absolute distance
between the drone and the center of the target. The formula for this calculation is provided in Equation (16).
When the distance value is less than 40 cm, the environment will grant a reward of +100 and end the training episode. By subtracting the current
from the distance at the previous time step,
, the difference d is obtained, as shown in Equation (17). This difference allows the system to determine whether the drone’s current action is moving it closer to or farther from the target. The environment uses
as the basis for reward or penalty, enabling the drone to learn to fly toward the center of the target by observing the score changes at each step.
Using a spherical shape as the judgment area introduces a margin of error when determining whether the drone has passed through the target. When the drone crosses near the four corners of the target, there is a possibility of false negatives, where the drone is incorrectly judged as not having passed through the target. To address this issue, the RIM incorporates a time penalty term,
, into the training episode, as shown in Equation (18).
As the duration of the episode increases, the penalty term will gradually grow. If the drone crosses the target but is not recognized as having done so, causing the training to continue, and the environment records that the current penalty term exceeds 0.2—indicating that the drone has made 40 decisions—the training episode will be forcibly terminated to prevent system training defects. Additionally, when the drone touches the target or an invisible obstacle, the environment will assign a penalty of −100. The boundary conditions and rewards for the RIM are shown in
Table 3.
Using the RIM as the reward method, training was conducted with three reinforcement learning algorithms: DQN, A2C, and PPO. The mean episode length and mean reward graphs for the three training results are shown in
Figure 11. From the results of the mean episode length (bottom of
Figure 11), it can be observed that DQN, A2C, and PPO converged at 31.0, 33.4, and 32.1, respectively. Compared to A2C, both PPO and DQN exhibited less fluctuation upon convergence, with PPO converging earlier than DQN. Specifically, PPO began showing an upward convergence trend around 20,000 time steps, while DQN did not display a similar trend until approximately 50,000 time steps.
Regarding the mean reward (top of
Figure 11), DQN, A2C, and PPO converged at 121.9, 118.1, and 142.8, respectively. The degree of convergence fluctuation, from the smallest to the largest, is as follows: PPO, DQN, and A2C. Although all three methods demonstrated an upward trend in average episode reward scores, the amplitude of variation exceeded 50, indicating that the rewards obtained by the agent in each episode were unstable. This instability complicates the predictability and control of the model’s performance, resulting in a decreased reliability of the overall training process.
The trained weights of DQN, A2C, and PPO were applied to the testing environment to evaluate their decision-making outcomes. The test results using the RIM are shown in
Table 4. From the test results, it can be observed that the target traversal rates for all three reinforcement learning algorithms were below 25%, indicating that they failed to effectively pass through the target consecutively.
4.2.2. Random Center Method (RCM)
Due to the tendency of the random initialization method to result in the UAV colliding with the edges of the target, it is necessary to modify the reward for passing through the target by introducing a reward and punishment design related to the target’s center.
Thus, the new RCM retains the initial and termination conditions, as well as most of the reward and penalty conditions, from the RIM. The only modification is the reward for passing through the target, which is now based on the drone’s position relative to the center of the target. The closer the drone passes to the center, the higher the reward. The specific design process is as follows: When defining the spherical area for the target, in addition to recording the center point of each target, the normal vector
of the target is also recorded. When the drone touches the spherical area, the environment calculates the vector
from the drone to the center of the target by subtracting the drone’s coordinates
from the target center coordinates
, as shown in Equation (19). Then, using the dot product formula, the
between the vector
and the normal vector
is calculated, and the
value is determined, as shown in Equation (20). From the
value, it can be observed that the closer the drone passes to the center of the target, the smaller the
value. Conversely, the closer the drone passes to the edge of the target, the larger the
value. The environment uses this characteristic to assign a reward score ranging from 0 to +100. A conceptual design diagram is shown in
Figure 10. The boundary conditions and rewards for the RCM are presented in
Table 5.
Using the RCM as the reward method, training was conducted with three reinforcement learning algorithms: DQN, A2C, and PPO. The mean episode length and mean reward graphs for the three training results are shown in
Figure 12. From the plot of mean episode length (bottom of
Figure 12), it can be observed that DQN converged at 30.6, while A2C failed to demonstrate effective improvement; PPO converged at 31. The PPO achieves convergence earlier than the DQN. PPO shows a higher gain convergence than the low gain convergence in DQN. In terms of the plot of mean reward (top of
Figure 12), DQN converged at 170.5, A2C again did not show effective enhancement, and PPO converged earlier at 143.7 than DQN.
The training result graphs indicate that compared to the RIM, the RCM exhibited smaller fluctuations during training, suggesting that the average steps and rewards per episode became more stable. Although the mean episode length remained unchanged, the mean reward showed an increase of 48.6 for DQN and 0.9 for PPO. This demonstrates that the additional reward mechanism introduced by the RCM effectively enhanced the training performance.
The trained weights of DQN, A2C, and PPO were applied to the testing environment to evaluate their decision-making outcomes. The test results using the RCM are shown in
Table 6. From the test results, it can be observed that the weights trained by A2C failed to control the drone to pass through the target, while the target traversal rates for DQN and PPO improved by 1.7% and 5.9%, respectively, compared to the results from the RIM.
4.2.3. Continuous Round Method (CRM)
In deep reinforcement learning algorithms, there must be a balance between the episode length and reward to ensure effective convergence towards the desired results. The CRM adjusts the termination conditions by limiting each training episode to the space between two targets while training the drone to consecutively pass through the target. The termination conditions for the CRM are the same as those in the RIM, but adjustments are made to the initial conditions. The drone is not regenerated at a random target at the start of every episode; instead, it is only regenerated in a random area if it collides with the environment’s boundary or the edge of a target. If the drone successfully passes through a target, although the episode ends and restarts, the drone is not regenerated randomly but continues its exploration from its current position. A conceptual design diagram is shown and deployed in
Figure 13. The boundary conditions and rewards for the CRM are presented in
Table 7.
Using the CRM as the reward method, training was conducted with DQN, A2C, and PPO reinforcement learning algorithms. The mean episode length and mean reward graphs are shown in
Figure 14.
It is evident that A2C still fails to effectively improve the values of both training results. From the mean episode length graph (bottom of
Figure 14), DQN converged at 25.8 and PPO at 26.4, with PPO achieving convergence earlier than DQN. Compared to the training results of the RCM, these values decreased by 4.8 and 4.6, respectively, indicating that the CRM can complete the traversal of the target frame with fewer steps, thereby enhancing flight efficiency.
In the mean reward graph (top of
Figure 14), it can be observed that PPO began to improve its values from the early stages of training, ultimately converging at 105, whereas DQN did not start to show improvement until 52,000 time steps, converging finally at 74.4. This indicates that both PPO and DQN can effectively train the drone to traverse the target frame, but PPO achieves better reward scores with a similar number of steps. Furthermore, from the value changes observed in
Figure 14, the differences in the values are approximately 10 and 50, respectively, demonstrating better convergence stability compared to the RCM.
The trained weights of DQN, A2C, and PPO were applied to the testing environment to evaluate their decision-making outcomes. The test results using the CRM are shown in
Table 8. It can be observed that A2C, due to its inability to improve scores during training, still failed to effectively pass through the target. In contrast, DQN and PPO both demonstrated improved target traversal rates, with one and five full-field flights, respectively. Therefore, this study selects the CRM as the final reward design method and continues using the stable and converging DQN and PPO algorithms for the subsequent experiments involving changes to state inputs.
4.2.4. Maximum Single Target Method (MSTM)
To apply the simulation results to a real-world environment, it is essential to unify the drone’s image reception in both the simulated and physical environments. This study proposes using YOLOv7-tiny combined with TensorRT object detection technology to recognize targets in both environments, and then mapping the detection results onto a black screen. This approach unifies the image input for both the simulation and real-world environments. In the simulation training, the target the drone is expected to pass through is always the largest target in the image. Therefore, the MSTM involves using object detection technology to identify the largest target, which is then rendered on a black screen and used as the state input for training. The state design for the MSTM is illustrated in
Figure 15.
Using the CRM as the reward method and the MSTM as the state input, the DQN and PPO reinforcement learning algorithms were trained. The mean episode length and mean reward graphs are shown in
Figure 16. In the mean episode length graph, DQN converged at 26.3 and PPO at 26.5, while in the mean reward graph, DQN converged at 39.8 and PPO at 76. In comparison to the CRM training results during the simulation phase, these values represent decreases of 34.6 and 29, respectively.
From the two training result graphs, it can be observed that PPO began to improve its values early in the training process, whereas DQN did not start to show improvement until 51,000 time steps. Although PPO exhibited a downward trend in mean reward during the later stages of training, when comparing the final convergence scores, PPO still demonstrated superior training performance relative to DQN.
The trained weights of DQN and PPO were applied to the testing environment to evaluate their decision-making outcomes. The test results using the MSTM combined with the CRM are shown in
Table 9. Due to the state input being less rich compared to the previous simulation phase, the target traversal rates for both DQN and PPO decreased compared to the best results from the CRM alone.
4.2.5. Maximum Two Target Method (MTTM)
Since the primary and subsequent targets for the drone will be the two largest targets in the frame, this study introduces the MTTM. In this method, the detection results displayed on the black screen are modified to include the two largest targets, with different colors distinguishing their sizes. The largest target is marked in blue, and the second largest in pink. If there is only one target in the frame, it will be displayed in blue. This ensures that the input state for training includes information on the next two targets. A conceptual design diagram for the MSTM is shown in
Figure 17.
Using the CRM as the reward method and the MTTM as the state input, the DQN and PPO reinforcement learning algorithms were trained. The mean episode length and mean reward graphs are shown in
Figure 18. In the mean episode length graph, DQN converged at 26.1 and PPO at 28.1. In the mean reward graph, DQN converged at −19.5, revealing a fluctuation of over 50 in the later stages of training, which indicates instability during DQN training. In contrast, PPO converged at 106.1, demonstrating an improvement of 30.1 compared to using the MSTM as the state input. Moreover, the difference from the CRM training results in the simulation phase was only 1.1, indicating that PPO, when trained using the MTTM as the state input, can achieve performance comparable to that of the CRM during the simulation phase.
Therefore, it can be inferred that PPO is more suitable than DQN for reinforcement learning training algorithms designed with the MTTM as the state input and CRM as the reward mechanism.
The trained weights of DQN and PPO were applied to the testing environment to evaluate their decision-making outcomes. The test results using the MTTM combined with the continuous episode method are shown in
Table 10. In the testing results, PPO achieved a target traversal rate of 52.1% and completed two full-field flights. The PPO preserved the high gain control phenomena both in the traversal and circuit numbers. The leading look-ahead ability can provide the drone with agile motion during the flight and hold stable learning dynamics during the flight mission. Although this performance is not as high as the results from the CRM during the simulation phase, it still demonstrates a success rate exceeding 50%. This indicates that using YOLOv7-tiny object detection to unify the state inputs of both the simulation and physical environments is a feasible approach for training drones to pass through targets in reinforcement learning contexts.