1. Introduction
At the beginning of 2020, a pneumonia epidemic caused by a new type of coronavirus broke out globally. Due to the latent and asymptomatic nature of the virus, the epidemic has spread widely around the world. The outbreak of COVID-19 has brought great inconvenience to people’s daily travel. Before traveling, people must consider the epidemic risk and travel as healthily and safely as possible. During COVID-19, the Chinese government resolutely chose travel control measures, such as traffic bans and travel restrictions, to stop the spread of the new crown pneumonia epidemic. However, it is important to note that necessary travel plays an irreplaceable role in maintaining the normal life of residents. Taking Beijing, China, as an example (
Figure 1), during the outbreak of the new crown epidemic, the travel intensity of urban residents dropped sharply. With the implementation of epidemic prevention and control in China, and hospital treatment, the epidemic was effectively controlled, and the travel intensity of residents in the later stage of the epidemic also increased. However, for residents’ travel, the implementation of undifferentiated restrictions and prohibitions will not only cause huge losses to residents’ daily life, but it will also be detrimental to the commuting needs of front-line staff in the fight against the epidemic. Therefore, the study of the risk-averse travel path planning algorithm during COVID-19 is of great significance to people’s travel path planning.
The spread of the new crown epidemic has brought a great impact to people’s travel. Before travel, people must consider travel risks and travel as healthily and safely as possible. Necessary travel plays an irreplaceable role in maintaining the normal life of residents. Therefore, the study of risk-averse travel path planning algorithms during the epidemic is of great significance for residents’ travel path planning. In this regard, some domestic scholars have proposed a safe path planning method based on the epidemic situation and related indicators [
1,
2]; Ma Changxi [
3] constructed an emergency customized bus route optimization model under the epidemic, and designed a genetic algorithm to solve it; Tu Qiang et al. [
4] proposed a label modification algorithm, which can calculate the path selection results in the face of different risk attitudes; Jia Fuqiang [
5] et al. described the risk-averse attitude of travelers based on the cumulative prospect theory and established a multi-objective decision-making model considering risk-averse travel paths; incorporating decision theory into path planning, Subramani et al. [
6] propose a step-by-step process for calculating the optimal path for risk; in order to solve the dynamic random shortest path problem, LiPing Fu [
7] proposes a heuristic algorithm based on K-shortest path algorithm; A. Khani [
8] proposed an iterative labeling algorithm to solve the problem of non-additive reliable paths from a starting point to all ending points; and combining decision theory with basic stochastic-time optimal path planning.
Reinforcement learning is one of the paradigms and methodologies of machine learning, mainly used to describe and solve problems in which agents learn strategies to maximize rewards, or achieve specific goals in the process of interacting with the environment. In recent years, scholars at home and abroad have conducted research on the use of reinforcement learning to solve problems related to path planning. Some Chinese scholars [
9,
10,
11,
12] have optimized and improved guiding agents to choose actions and reward function settings. Although the results are better than traditional methods, they still need some improvement in the work. A. Wang et al. [
13] proposed a cross-regional customized bus route planning algorithm based on improved Q-learning during the epidemic. Sharon Levy [
14] of the United States introduced a new path generation method through deep reinforcement learning, which was able to successfully optimize the multi-criteria path selection problem. Cian Ryan et al. [
15] proposed a proactive approach to assess autonomous driving risk based on driving behavior and safety-critical events, and process behavior data through telematics for autonomous driving risk-aware path planning. Last year, Bdeir A. [
16] proposed an autoregressive strategy of sequentially inserting nodes to solve the problem of constructing a solution to the capacity constrained vehicle routing problem (CVRP). Xu [
17] built a MRAM model with multiple relationships based on the AM model [
18] to better obtain the dynamic characteristics of vehicle routing problems. Zhang Ke [
19] proposes an encoding decoding framework with attention layer to iteratively generate multiple vehicle tours.
At present, the classification of epidemic risk levels in China is mainly determined by information released by the National Health Commission. Relevant scholars at home and abroad only consider the current epidemic indicators when planning travel routes during COVID-19. Relevant studies [
20] have shown that in confined spaces, the new coronavirus can indeed be transmitted through aerosols, and the transmission distance can reach 4.5 m. Since the new coronavirus can be transmitted in the air, we must consider the risks of epidemic-related areas. When using reinforcement learning to solve the path planning problem, some scholars optimize action selection. Although the convergence speed is improved, due to the limitation of the reward function, the desired path will still pass through the risk-related area. Reinforcement learning is a method in which an agent continuously interacts with the environment to adapt to the environment. At present, the application of path planning has shortcomings, such as slow convergence speed and unstable calculation path results.
Table 1 compares the limitations of related work.
Therefore, we use the SUMO simulator to build the actual road network model and design a method to extract the road network impedance matrix, which greatly improves the efficiency and accuracy of road network modeling. We have established an enhanced learning path planning model in the context of urban traffic, and designed a search mechanism to avoid risk related areas of the new epidemic. We designed a restrictive search machine initialization Q table to improve the exploration efficiency in the early stage of agent learning. We also introduced the artificial constant function to guide the direction of the agent during learning with different probabilities, and explored the influence of the probability on the convergence result. We used the improved reinforcement learning algorithm to solve the residents’ travel path planning problem during the new epidemic situation, and made comparisons with other algorithms. The results show that the algorithm can take both the rapidity and security of travel path into consideration, and it is better in term of convergence speed and stability of convergence.
3. Algorithm Design
3.1. Impedance Matrix Generation Method
Since we combine the model-free reinforcement learning algorithm and the road network model to improve the algorithm, the road network needs to be physically modeled. Since the manual modeling method is too cumbersome, this paper proposes a road network impedance matrix based on SUMO simulation. SUMO is a reality-based open-source software for micro-traffic simulators with strong portability and secondary development capabilities. The specific method of generating the road network impedance matrix using SUMO simulation is as follows:
Step 1: Use the SUMO simulator to build a road network consisting of nodes and edges;
Step 2: Extract the number and coordinates of nodes in the road network;
Step 3: Parse the road network configuration file and extract the list combination of [edge, start point, end point] of the road network;
Step 4: Calculate the length of each side according to the coordinates and assign the matrix impedance;
Step 5: Set the non-existing road impedance to infinity.
The generative matrix style is shown in
Figure 2.
When we use the SUMO simulator to build the road network model, the OSM open-source map is usually used to define the location of each intersection in the road network and the connection mode between intersections. The location and coordinates of the risk area are recorded in the real epidemic data. We mapped the location of the actual risk area to the SUMO simulator at the same scale as the map, thus defining the risk area.
3.2. Restrictive Search Mechanism for Initializing Q Table
In order to improve the exploration efficiency of the agent in the early stage of reinforcement learning and accelerate the convergence speed of the algorithm, we design a restricted search mechanism to initialize the Q-table operation.
Before calculating the path, it is necessary to set the start and end points. We regard the starting point and the end point as the focus of the ellipse, and the long axis of a certain length forms an ellipse. As shown in
Figure 3, with point A as the starting point and point B as the end point, the reinforcement learning path planning is calculated. Then, calculate the ellipse with A and B as the focus according to the focal length, and find all the vectors in the ellipse. Instead of the traditional 0 setting, we increase the q value of the vector corresponding to the action of the vector
angle less than or equal to 90 degrees to a normal number instead of the traditional setting of 0.
3.3. Improved Artificial Potential Field Method to Optimize Action Selection Strategy
In order to speed up the exploration efficiency in the agent learning, we optimize the agent’s action selection strategy by improving the Artificial Potential Field (APF) method. In the early stage, the artificial potential field method was proposed by foreign scholar Khatib [
26]. Its basic idea is that the target point has a “gravitational force” on the agent, and obstacles, roads or environmental boundaries act as opposite sex to generate a “repulsive force” on the agent. Under the joint action of attraction and repulsion, the agent is guided to move towards the direction of the resultant force. Since the risk factor has been embedded in the reward to the agent when setting the reward function in this paper, only the improved gravitational field function is embedded in the reinforcement learning algorithm. The commonly used gravitational field functions are as follows:
In the formula,
is scale factor,
represents the distance between the current state of the object and the target. Referring to its definition, the gravitational field function in this scenario is as follows:
In the formula, represents the gravitational potential field of the agent under the state S, and represent the coordinates of the current position of the agent and the coordinates of the target position. is gravitational field unit vector, its direction is that the current state of the agent points to the state of the target node. p is the probability of using the artificial potential field function to guide the decision, and the strength of the gravitational potential field is represented by the second norm of the gravitational potential field.
Figure 4 is a schematic diagram of an agent’s decision guidance using the improved artificial potential field method. In the figure, {
o→a,
o→b,
o→c,
o→d} are all the action sets that the agent can choose in the current state. The direction of the gravitational field is from the current position o to the target position t, and the size is
. We take the direction of the gravitational field as the baseline and make 45° and 90° rays on both sides, respectively, namely the vectors
. We use the improved APF method to guide the agent to choose an action policy is as follows:
In the formula, a is the action selected by the agent in the current state, → represents the movement from one point to another. We use the 2-norm of to express the value of the gravitational potential field of the agent at point o. We take half of as the boundary for action screening, this is to allow the agent to have more choices in the early stage of the calculation path. It can let the agent choose better actions in the later stage, so as to improve the exploration efficiency when approaching the target.
In order to make the artificial potential field method more effective and without loss of generality, we design a scheme to use the artificial potential field to guide the strategy according to the probability. In the case where the agent randomly chooses an action, the artificial potential field method is used with a bootstrap probability p. We set the bootstrap probability . The larger the p, the higher the probability of using the APF method, and vice versa.
3.4. Dynamic Greedy Strategy
In the whole process of reinforcement learning, how to balance the exploration and the utilization of the exploration efficiency of the agent, and the convergence speed of the whole reinforcement learning, is very important. Traditional methods usually adopt a stage-growing
greedy strategy to balance exploration and utilization, but the stability of convergence is not very good. In order to improve the smoothness and stability of convergence, we design a dynamic
greedy strategy, and defines the greedy rate as follows:
In the formula, x is the number of times the agent learns, b is coefficient of variation. is the probability that an agent selects an action according to the maximum action value. Because it needs to comply with the basic law of agents exploring first, and then use, it is proved by experiments that the value of B should be greater than 1.
When other parameters are the same, the convergence diagram of reward value before and after adjustment is shown in
Figure 5. It can be seen that the dynamic greedy strategy converges more stably, than the traditional greedy strategy with staged growth.
3.5. Algorithm Process
Based on the above method, we propose a Restricted Reinforcement Learning-Artificial Potential Field (RRL-APF), and its process is shown in
Figure 6. Before the agent performs reinforcement learning, it needs to extract relevant information about the road network and epidemic risk areas. The initial reward matrix was established according to the road network model, and sets the reward value corresponding to the nonexistent road as a negative constant with a large absolute value. According to the set safety distance, the reward corresponding to the road in the epidemic risk related area is set to a negative constant with a small absolute value. At the same time, the optional action set of the agent in each node state is constructed. Then, use the restrictive search mechanism to initialize the q-table, and set the Q-values corresponding to all directional edges within the ellipse that are less than 90 degrees from the starting point, to the end point as normal numbers. After setting the start and end point information, the reward value of the last step before establishing the matrix distance from the end point can be enlarged, to improve the convergence speed. Then, reinforcement learning starts, initializing the current state of the agent as the starting point, and filtering the optional action set according to the current state. Due to the high exploration rate and low utilization rate of the agent in the early stage, when the agent randomly selects an action, it uses the RRL-APF method to guide the selection action with P probability, and then immediately executes the action and updates the status, calculates the reward value is calculated and the Q table is updated.
Through experimental comparison, we use the Sarsa algorithm similar to the Q-Learning algorithm [
27] to update the Q table. The update mechanism is as follows:
In the formula, is the current state and the Q value of the corresponding action, is the Q value corresponding to the next state and the actual action taken, is learning efficiency, is Attenuation factor.
When the agent learns, if the current state is the target position state, the current round of training ends, otherwise, it continues to select actions according to the current state in this cycle. After each training, that is, after the agent reaches the target point from the starting point, it records the relevant results of this training, updates the parameters of the greedy strategy according to the conditions, and then carries out the next training. The detailed flow of the algorithm is shown in the Algorithm 1.
Algorithm 1. Pseudo code of RRL-APF algorithm. |
RRL-APF Algorithm |
1. | According to the road network model extracted by SUMO, the epidemic risk location information is determined and the safety distance is set. |
2. | Initialize the reward matrix, build an optional action set, and initialize the Q table by using the restrictive search mechanism. |
3. | Parameter initialization: learning efficiency, attenuation factor, training times, maximum exploration probability. |
4. | Initialization starting point: Sstart, Starget. |
5. | For episode = 1 to N do: |
6. | Update parameter in greedy policy. |
7. | Build the initial state according to sstart: Snow = Sstart. |
8. | While True: |
9. | Filter the action set as (A1, A2, A3, …) according to the status. |
10. | If random number < : |
11. | Select the action ai with the largest Q value in as. |
12. | Else: |
13. | If random number < p: |
14. | Select action ai according to the improved APF method. |
15. | Else: |
16. | Random selection action ai. |
17. | Execute the action and get the next state Snext. |
18. | Calculate the distance between Snow and Snext from the target, observe the location of epidemic risk areas in the environment, and calculate the reward r of the action according to the reward function. |
19. | If Snext = Starget: |
20. |
|
21. | Else: |
22. |
|
23. | Update status: Snow = Snext. |
24. | If Snow = Starget: |
25. | Break. |
26. | Count the cumulative rewards, path length, distance from the risk area and learning times of this training |
27. | end. |
28. | Output the optimal path scheme and algorithm convergence diagram in the whole training process. |
4. Algorithm Verification
We take the surrounding area of Xinfadi, Fengtai District, Beijing, as the actual road network, and take the residential areas or places where COVID-19 cases have been active in 2020 as the actual data to verify the RRL-APF algorithm. Then, in order to verify the superiority of this algorithm, we compare the proposed algorithm with the traditional Q-learning algorithm, the Sarsa algorithm, and the reinforcement learning algorithm, based on artificial potential field. We make a comparative analysis with other algorithms in terms of path length and travel risk. The experimental environment is the Windows 10 operating system, AMD Ryzen 7-4800H-2.90GHz processor, 16GB running memory, the programming language is Python, version 3.8.2, program compiler is PyCharm, version 2021_X64.
The data is provided by the Beijing Municipal Health Commission. The actual road network in the study area is shown in
Figure 7. The road network has a total of 249 intersections, 770 directional roads, and seven epidemic risk areas and activity venues. The location of road network model and risk area constructed through SUMO simulation is shown in the
Figure 8.
The impedance matrix generation algorithm is used for calculation. The results are shown in
Table 2:
In the white area of
Table 2, each number represents the distance between two nodes. The diagonal is 0 because there is no distance between the same node. There are a lot of “inf” in the table. Its numerical meaning represents infinity, which means that the corresponding road does not exist. For example, the distance between node 0 and node 2 is inf, which means that node 0 and node 2 are not directly connected. That is to say, the agent cannot choose the action from 0 to 2, because the path from 0 to 2 does not exist. At the same time, under the premise of ensuring accuracy, in order to simplify the calculation, this paper assumes that each road is two-way and the road length in both directions is the same. For example, the distances from 0 to 1 and from 1 to 0 are the same.
The extraction of epidemic risk areas and the activity places of epidemic personnel are shown in
Table 3. The coordinates of each risk area are the result of the mapping of the centroid of the COVID-19 risk area on the SUMO simulator.
Since there are 249 nodes in the road network, the dimension of the reward matrix is defined as a two-dimensional matrix of 249 × 249. When initializing the reward matrix, we set the reward value corresponding to the nonexistent road and repeating node to a large negative number, and build an optional action set based on it, as shown in
Table 4. By assuming three different travel scenarios, the coordinate information of different starting and end points in the SUMO simulator is extracted, as shown in
Table 5.
After the starting and end point information is determined, a Q table is established and initialized according to the restricted search mechanism. The Q-table has the same dimensions as the reward matrix. We take the starting point and the end point as the focal point, and use the focal length of the square root of 2 as the long axis to establish an ellipse, and filter out the vectors in the ellipse. In each scene, we find the vector whose included angle is less than 90 degrees with the vector from the start point to the end point, and set the corresponding q value to 100.
We used the RRL-APF algorithm to conduct simulation experiments for up to 300 times of learning, and the process of an agent from the starting point to the end point through exploration is defined as a complete learning. In order to verify the superiority of the RRL-APF algorithm in terms of convergence speed and other aspects, the RRL-APF algorithm was compared with the Q-Learning [
27] algorithm, the Sarsa [
28] algorithm and the RLAPF [
9,
12] algorithm under the same starting and end point, and epidemic risk location information.
Through experimental comparison, we set the learning rate = 0.01, attenuation factor = 0.9, coefficient of variation b = 1.02, the maximum greedy coefficient is 0.9. We use the RRL-APF method to set the probability of selecting an action with p = 0.5 and a safe distance of 200m, the reward transfer coefficient = 0.1.
The specific parameter settings of each algorithm are shown in the
Table 6. Among them, b is the coefficient of variation in the dynamic greedy strategy proposed in this paper. The vector filter range is the angle range of the “good” vectors filtered out when initializing the Q table. P is the probability that agents use the improved artificial potential field method to select actions when they randomly select actions.
Due to the reinforcement learning algorithm randomly selecting actions under certain conditions, there is a certain randomness in each learning process, and a single execution cannot explain the pros and cons of the algorithm. Therefore, when the parameters such as efficiency and attenuation factor are the same, we run the four algorithms for 10 times, respectively. We obtain the average value of reward value, total length of path and training time in each learning process, as shown in
Figure 9,
Figure 10 and
Figure 11.
Taking Scenario 1 as an example, it can be seen from
Figure 9 that the RRL-APF algorithm proposed in this paper is superior to the other three algorithms, in terms of the convergence speed of the three parameters.
- (1)
In the reward convergence graph, the RRL-APF algorithm and the RLAPF algorithm have the same convergence trend, and are significantly higher than the Q-Learning algorithm, and the Sarsa algorithm in the initial stage. It indicates that compared with the traditional algorithm, the artificial potential field method will be more obvious after adding the method. By reducing the early trial and error behavior, the reward value has increased by approximately 1167%.
- (2)
In terms of path length, compared with the other three algorithms, the path length of the Q-Learning algorithm increases significantly when the number of learning is between 80 and 170, and the convergence process fluctuates greatly in the later stage of training. This shows that the Q-Learning algorithm is more volatile than the other three algorithms, so for the sake of simplicity, a comparative analysis of the other three algorithms is carried out.
We take the total path length of 10,596 m as the benchmark. In order to achieve the goal, Sarsa needs to be trained 100 times, and RLAPF needs to be trained 82 times, while the algorithm in this paper is trained only 45 times, and the training times are reduced by 55% and 45%, respectively. This is because the RRL-APF algorithm uses the restricted search method to initialize the Q table before the agent starts training. During the training process, the improved artificial potential field method guides the agent to choose more favorable actions. This can effectively shorten the convergence time of the whole training process.
- (3)
According to the results of a single training time, it can be seen that the algorithm in this paper has an average reduction of 39% and 26% in the number of training times, compared with the Sarsa algorithm and the RLAPF algorithm. It indicates that the algorithm proposed in this paper has higher computational efficiency and faster convergence speed.
Part of the Q table after reinforcement learning under Scenario 1 is shown in
Table 7. Each value in the table represents the value of every action of the agent after reaching the maximum number of learning times.
Figure 12,
Figure 13 and
Figure 14 shows the calculation results of the path length and the average distance from the risk area of the four algorithms in each scenario. Taking Scenario 1 as an example, the corresponding travel route map is shown in
Figure 15. Among them, the “risk area” is announced by the China Health and Health Commission. China implemented closed control measures for high-risk areas in the epidemic, and specifically adopts the measures of “can’t get in, can’t get out” or “can only go in, can’t go out”. At the same time, taking the residents of Beijing, China, as an example, if people did not enter but passed there or passed the nearby area, their “Beijing Health Kit” may appear abnormal. During the epidemic, citizens in Beijing need to show the “green code” (a description of personal information under normal conditions) when entering all public places (including shopping malls, supermarkets, etc.). According to the results, it can be seen that the results of the algorithm without considering the epidemic risk will still pass through the epidemic-related area. This may bring the risk of virus infection and abnormal personal health information to people traveling. At the same time, compared with the other three algorithms, the average total path length of the RRL-APF algorithm under the same starting and end point conditions decreased by 3.58%, 1.82%, and 7.51%, respectively. And the average distance from the risk area increased by 7.77%, 4.33%, and 10.32%. It can be seen that the RRL-APF algorithm proposed in this paper can take both the travel distance and travel risk into account at the same time, and ensure that the travel distance is reduced to a lower level on the basis of sufficient safety.
Figure 16 and
Figure 17 show the path calculation results in Scenario 2 and Scenario 3. It can be seen from the figures that when the distance between the start and end points is short or the available paths are few, our algorithm and the other three algorithms may produce the same path results. For example, in Scenario 3, although the optimal paths obtained by the four algorithms are the same, according to the results in
Table 8, the average path length and the average distance from the risk area are still different. Compared with the other three algorithms, the average total path length of the RRL-APF algorithm under the same starting and ending point conditions decreased by 7.93%, 6.52%, and 7.33%, respectively, and the average distance from the risk area increased by 19.9%, 11.8%, and 9.62%. However, compared with the algorithm that does not consider the epidemic risk and the algorithm that avoids the epidemic risk area, the results of the path calculation are still very different. It shows that our algorithm can ensure that travelers avoid epidemic risks as much as possible during travel, and maximize their own safety.
The RRL-APF method designed in this paper mainly includes the restricted search mechanism and the improved artificial potential field method. In order to verify the effectiveness of the two methods, we set up ablation experiments for experimental verification.
Figure 18 shows the ablation experiments results in Scenario 1. The RRL-APF algorithm in the figure is designed in this paper. RR represents the experimental results after removing the artificial potential field method, and APF represents the results after removing the Q table initialized by the restrictive search mechanism. From the convergence diagram of the three results in the figure, it can be seen that the convergence speed of the artificial potential field method is significantly faster than that of the method without artificial potential field. In terms of reward value, path length and single training times, the convergence speed is reduced by 71%, 52%, and 60%, after using the artificial potential field method. This proves the effectiveness of the artificial potential field method. In terms of reward value and training times, the convergence trends of APF and RRL-APF in the figure are roughly the same. However, in terms of path length, it can be seen from the figure that the green lines are almost entirely below the orange lines. After calculation, the convergence speed of using the restrictive search mechanism is approximately 38% lower than that of not using it, which proves the effectiveness of the restrictive search mechanism.
Under the premise of randomly selecting actions, in order to explore the influence of the probability of the agent using the artificial potential field method to select actions on the convergence results, we added a group of parameter control experiments under different probabilities. To make the experimental results clearer, and intuitive, we fit the convergence curve. The fitting curve of the experimental results is shown in
Figure 19. From the convergence graph of reward value, it can be seen that the reward value at the early stage with a probability of 0.9 is significantly lower than that with the probability of 0.5 and 0.7. However, from the middle to late stage, especially after the agent has learned 25 times, the reward value is almost (
p = 0.9) > (
p = 0.7) > (
p = 0.5). This shows that in the middle and late stages of agent learning, the reward value increases with the probability of using APF. According to the path length and the number of single trainings, the result of
p = 0.9 in the early stage of agent learning is larger than that in the other two cases, but the convergence trend of the three in the middle and late stages of agent learning is roughly the same.