1. Introduction
Public bicycles (with piles) and shared bicycles (without piles) are low-carbon, green, environmental, and short-range means of transportation and an important part of an urban green transportation system, fundamentally solving
the first kilometer and
the last kilometer [
1] problems. Since 2016, over 1000 shared bike systems have been deployed in over 60 countries worldwide to promote green transportation and healthy lifestyle [
2]. There is some dissatisfactory experience of bike users who need help to rent bikes from stations lacking bikes or return bikes to some stations lacking space. This reduces the usage rate of bicycles, leading to issues such as customer loss, poor management [
1,
3,
4], and so on. The Public Bike Company (PBC) needs to rebalance the demand value of people and the supply value of bike stations to solve the dissatisfaction issue, called the Bike Rebalancing Problem (BRP), and keep afloat. Before solving the BRP, it is necessary to establish that there are three main reasons for this phenomenon. First, the behavior pattern of bike users is unidirectional rather than bidirectional, which is necessary for creating imbalances. The difference between stations with neighborhood functional areas is an inherent factor leading to imbalance. For example, cyclists travel from residential stations to official stations in the morning, and the reverse behavior occurs in the evening. Finally, the behavior of bike users displays a pattern of morning and evening peaks in the time domain. These intensify the imbalance between supply and demand at bicycle stations. In summary, the mismatch between the demand catered to by the stations and the actual demand from users is urgent to solve. As a result, we cannot look for the method of solving the BRP from the origin. However, the BRP is crucial to ensure people’s travel requirements and the quality of bicycle service. In this article, we propose a new method to solve the BRP based on Reinforcement Learning (RL) [
5] theory.
According to the division of time for rebalancing tasks, the issue of the BRP is divided into the Static Bike Rebalance Problem (SBRP) and the Dynamic Bike Rebalance Problem (DBRP). The first type describes the fact that in non-peak hours of operation during the day or at night, large vehicles such as trucks transport excess bicycles to stations lacking bikes. To optimize balance time or improve user satisfaction, we find the optimal truck route and complete the redistribution of the number of bicycle station points. This type of method requires the nodes that the truck passes through to not be duplicated, which is generally a variant of TSP [
6]. Another approach is to hire bicycle users to balance the number of bicycles at certain stations (or areas). Bicycle users can see the tasks published by the system on a smart app. A red envelope icon represents these tasks. The users need to click the red envelope icon to view the task content and compensation. The stations or bicycles with the red envelope attached are often jokingly referred to by users as red envelope stations or red envelope bikes. Current research on DBRP is much more scarce than that on SBRP. However, the DBRP has apparent advantages. The unit cost of truck-balanced bicycles is much higher than the unit cost of user-balanced bikes. When we adopted a user approach to reduce rebalancing costs, we established the Bike Rebalancing Incentive System (BRIS) to address BRP.
The Bike Rebalancing Incentive System (BRIS) has two adversarial targets: bicycle operators and bicycle users. Bicycle operators release balance tasks and present them on red envelope station or red envelope bikes. After completing the task, bike users receive compensation from the red envelope station or red envelope bikes, while the bike sharing system (BSS) performs the rebalancing work. The red envelope setting requires us to consider the impact on the cycling environment. Firstly, the red envelope system can cause bicycles to accumulate in or evacuate a certain area quickly. Fricker and Gast found that after users participate in rebalancing work, the performance of BSS significantly improves. Secondly, the red envelope isystem s essentially a commodity [
7]. Its price fluctuates and is not fixed. When considering whether to accept the rebalancing task, users determine a personal cost [
8]. When the red envelope price is higher than the personal cost, the user accepts the task; otherwise, the user refuses the task and the supply–demand conflict in the bicycle system still exists. The price is related to the completion of the task. Afterward, the price of the red envelope task can be associated with the degree of urgency related to the state of the bicycle environment. When releasing tasks, it is necessary to consider the current situation and solve BRP from a long-term perspective [
9]. Finally, the bike stations are closely connected, possibly leading to cutthroat competition between the users of red envelope stations. This article develops a price strategy for red envelope station users as an Incentive Strategy (IS). We build incentive strategies based on the idea of the the Markov decision process (MDP) theory[
5] and focus on the following three issues.
Question 1. How to evaluate an incentive strategy?
Ghosh et al. proposed an online scenario generation method that formulates rebalancing strategies based on virtual worse demand scenarios [
10]. In this article, we establish a bicycle environment simulation scenario—Bike Gym. It contains the environmental information and operational rules of the entire bicycle network. Each incentive strategy results in a scenario. We can obtain user satisfaction with bicycle journeys and user acceptance of rebalancing tasks under each incentive strategy. We evaluate the effectiveness of an incentive strategy based on user feedback. We establish Bike Gym because, in the real world, once incentive strategies are formulated incorrectly, irreversible losses can occur. In addition, each incentive strategy and its feedback are valuable experiences to learn from. We develop the optimal incentive strategy through extensive historical experience.
Question 2. How to improve incentive strategies?
Firstly, the flow of self-operated station points, the existing inventory and maximum storage capacity of the station, as well as nearby environmental and historical information, all directly affect the formulation of strategies. We use the multidimensional vector to represent the feature information related to the rebalancing task. The time information of stations is extracted by GRU [
11], which is the variant of the long short-term memory (LSTM) [
12]. In addition, the interaction between nodes, represented by the neighbor matrix
A, also affects the formulation of strategies. The spatial information of stations is extracted by the graph convolution network (GCN) [
13,
14] with matrix
A. In addition, GCN is applied to the prediction problem of bike network [
15,
16,
17]. Finally, when formulating a strategy, we should consider its long-term benefits for the future rather than just current feedback. The advantage of early planning is that it can handle the {bike tide phenomenon of bicycles more calmly.
Question 3. How to deal with dimensional disasters that may arise in incentive strategies?
The high-dimensional node characteristic and a large number of nodes simultaneously result in very large state space and numerous incentive strategies in MDP theory [
5]. Even with the help of computers, it is difficult for us to obtain the optimal solution. Seijen et al. proposed and verified that deep Q networks (DQN) [
18] can decompose the entire reward function into multiple components [
19]. Based on the principle of hierarchical thinking, we use a combination of low-dimension functions to approximate our model’s action and value functions. It can prevent the occurrence of dimensional disasters.
In BRIS, the most crucial task in solving the rebalancing bicycle problems is formulating the optimal incentive strategy, which is also one of the core works in this article. Therefore, we propose a hierarchical reinforcement learning structure called spatiotemporal rebalancing pricing algorithm (STRPA) that can extract spatiotemporal features, as shown in
Figure 1. This model can better extract spatiotemporal features related to rebalancing and optimize incentives. At last, the related work in shown in
Section 2. The Bike Gym, generating interactive experiences, is explained in
Section 3 and the STRPA is explained in
Section 4.
Section 6 is the representation of relevant results.
Section 7 is the conclusion.
The main contributions of this article are as follows:
Two sub-modules, Bike Gym and STRPA, are established in BRIS.
From the perspective of allocating user demands based on the spatiotemporal characteristics of bicycles, the rebalancing bicycle problem is solved with the MDP idea.
The hierarchical principle is applied to solve the problem of dimensional disasters caused by complex state spaces.
The matrix representing the complex node relationships is introduced. The use scope of the algorithm can be broadened and applied to non-European structured networks such as complex networks. BRIS is applied as an experiment to Nanjing public bike data set in this article.
3. Bike Gym
Here, we design a general environment of a bicycle network—Bike Gym. Why do we define the bicycle environment with a Bike Gym? The gym package is imported when we solve RL with the programming tool Python. The gym owns some environment cases and creates some new environments. Therefore, we name the part of the imitating environment the Bike Gym. Bike Gym is a program used in a computer to create a simulation environment for a system. Many relevant definitions and theoretical foundations need to be used in this process. Therefore, we use the functions in the program as partitions and introduce their corresponding knowledge points. The running framework of Bike Gym is demonstrated by Algorithm 1. It mainly has initialization (initialize function), execution (step function), resetting (reset function), storage (memory pool), and other abilities. The initialization is corresponding to the procedure of the initialize function. It defines the structure of variables; for example, bike trajectory (seen in Definition 1). It defines the bike context environment, containing locations of bikes, weather data, week data, and so on. It defines the meaning of variables; for example, time distance (seen in Definition 2). It defines the value of variables; for example, the delta in Equation (
1). It is worth mentioning that we imported a large amount of real bicycle trajectory data during this process, which allowed the filtering and constructing of new bicycle trajectory data based on parameters. Execution corresponds to the procedure of the step function. The current state
becomes the next state
after action
, and then the system can provide feedback
. That is the core content of the gym, including the operation rules of the bicycle network, described in
Section 3.2 (seen in Algorithm 1). Resetting corresponds to the procedure of reset function. The chain of
(seen in
Figure 2) is not infinite in the real world, and we need different values of the initial state. Therefore, the current state can be interrupted, and the initial state and the context information can be reset; for example, the data pertaining to a given week. Storage is corresponding to the procedure of the memory pool. Experience bar
is built and stored in the memory pool. The memory pool has a fixed capacity. The old experience bar is deprecated, with new experience added when the pool is full. The rule also ensures that the model is constantly learning new experiences. Refer to
Section 3.4 for detailed information.
Algorithm 1 Bike Gym. |
- 1:
Initialize bike temporal network and set context feature - 2:
Initialize and with 0 which represents user choice - 3:
Hyperparameter Node number N, Time speriod , penalty - Input:
current state s, action a - Output:
next state , reward r - 4:
Extract bicycle usage status from bicycle trajectory, obtain the time behavior sequence , where elements mean a user borrowed or returned a bike on station at time T. M represents the behavior of borrowing or returning. - 5:
for time do - 6:
Match a user and user u location to node V; - 7:
According to Equation ( 2), obtain user generalized cost; - 8:
Recommend suitable sites based on user location and car usage patterns for user u; - 9:
Calculate the user’s utility cost and obtain the user’s choice according to Equation ( 3) or Equation ( 4); - 10:
Obtain current rewards based on user satisfaction.
|
3.1. Gym Initialization
Popular bikes are divided into public bikes (with piles) and shared bikes (without piles). The public bike takes the station as the node in the network, and the shared bike often takes the artificially divided grid as the node in the complex network theory [
30]. The matrix of the complex network can uniformly express the relationship of nodes. To make Bike Gym imitate the environment that is usually run, it must be ensured that the data are actual data and that the environment operation rules conform to reality. Next, the relevant information used in Bike Gyms is defined.
Definition 1 (Bike trajectory). Take the bicycle station as the node in the network, and all nodes in the network are represented by the set . Each bicycle track can be represented by a four-tuple , where indicates the source station, means the destination station, indicates the start time, indicates the termination time.
Temporal network
[
31] contains a node set
V and an edge set
E. The nodes represent the bike stations, and the edges represent the bike trajectories. Significantly, the edge contains time information.
Definition 2 (Time distance). The average riding time between nodes is the time distance of nodes. represent the time distance between node and node .
The smart system recommends nearby nodes to users based on the time distance between nodes. A new bike path is generated in the bike environment if a user receives the recommendation. The geographical distance between nodes is not the linear distance on the map but the distance of the actual cycling path. Because the cycling speed is stable, the cycling distance and time are directly proportional. We use cycling time to represent the distance between nodes. Each node has a digital label, name, longitude and latitude coordinates, as well as other information, so the Baidu Maps app can also be used to select an appropriate route to calculate the time distance between nodes. The applied matrix represents the time distance between nodes. In a complex network, if the node and node are not related, the relationship is indicated by 0; otherwise, it is indicated by a nonzero value. Now, the relationship between nodes is determined based on their time distance. If node can affect node , there is a connection between the two nodes, and node is called the neighbor of node . The formula is as follows.
Definition 3 (Neighbors of nodes).
If the relationship between node and node is less than , node and have a neighbor relationship; otherwise, the do not. For public bikes with piles, the station’s capacity is the maximum value of vehicle storage. For shared bikes without piles, the actual geographical area of the storage location is the maximum value of vehicle storage. indicates that the station can store the maximum number of bicycles. indicates that station owns the number of bicycles at times t. Therefore, there are . represents the external environment characteristics of the entire bicycle network at times t.
Our Bike Gym stores much historical data on bike trajectory and basic information about the bicycle environment. Therefore, the simulation system can extract historical bike paths and generate new ones according to the current environment.
3.2. Operation Rules
For users, every trajectory contains an origin node and a destination called an OD pair. The method of this paper adjusts OD pairs by giving users a bonus. As shown in
Figure 3, node
v is the old origin node of the user, and there are two candidate nodes,
and
. Users think conscientiously that node
is selected in their best interests. Solving the user demand or assigning the bike number of nodes is also achievable through detouring some special nodes. In this process, Bike Gym receives the incentive strategy and determines the price for every node, which is shown to users. Whether the user accepts the task is related to the user’s private cost. Since private costs are opaque, we made the following assumptions.
Assuption 1. We extend the user cost of the convex structure proposed by [4] and introduce the variable of user age. The cost of walking distance of x from user can be expressed as follows:where α is the coefficient. represent user’s generalized cost. If the distance between a user and a red envelope bike is lower than , the user’s private cost is thought to be zero. If the distance is very large and exceeds the user’s endurance distance , users do not choose to tour a node. Definition 4 (Action). represents the action which Bike Gym implements at time t, where represents the users borrowing bikes at station and obtaining the task bonus, represents the users returning bikes at station and obtaining the task bonus.
represents the
user neighbor of station
who can look up the incentive information on the smart application. Based on greedy psychology, users choose the stations they benefit most from, and the new stations can replace the old stations. The effective cost of the users is equivalent to the difference between the money given by the operator and the task cost, i.e.,
. The scheme which users select on the new station during the source or the destination detour to obtain maximized benefits can be expressed as
where
is the inverse of the maximum function.
and
represent users utility cost. Equations (
3) and (
4) are the rules of users choosing candidate nodes.
Assuption 2. Because our goal is to reduce the cost of bicycle rebalancing by encouraging users, there is a ceiling of B for daily balancing costs. B is lower than the cost of rebalancing through a truck. Therefore, the balance cost of nodes in the whole network in one day is constrained by B. The formula is as follows: When Bike Gym accepts the incentive strategy, each user can make the best choice based on greed. However, not every user can execute this optimal selection. Users are constrained by the total bike rebalancing cost. When the rebalancing cost from Bike Gyms is insufficient, rewards cannot be promised to users after performing tasks. At this time, the agreement between users and agents is broken. Bike Gym also adds contextual environment information such as weather and temperature. If it rains suddenly when the user is performing a rebalancing task, the user fails to complete it. In addition, users may also be unable to complete a task when there is no bike available at a station or no parking space at a return station. Therefore, the bike environment needs to provide different feedback on completing the task.
represents user providing feedback on node . The basic standard is that users can borrow or return a bike successfully. Users participate in the rebalancing work by detouring the nodes. The feedback is represented as 1, and if users borrow or return bikes unsuccessfully, the feedback is represented as a negative value.
3.3. System Dynamic
indicates the bike’s number of users picking up bikes on node
during time
t.
indicates the bike number of users picking back bikes on node
during time
t.
means requests station
changed to station
.
means new request station
after price incentive.
means return station
changed to station
.
means new return station
after price incentive. Considering the bicycle infrastructure, actual road conditions, and user psychology, Bike Gym simulates the bicycle itinerary of stations in the city. The bicycle request behavior, the return behavior, and the number of bicycles stored at the site are evolving, as shown in the following formula:
represent the user demands of node
i to move to other nodes during the
t time period.
represent the demands of other nodes besides node
to move to node
. Therefore, the left of Equation (
7) represents the new demands of the user during the
t period. The meaning of Equation (
8) is the same as that of Equation (
7). What is more, the time sequence
represents all behaviors of node
during the
t time period; for example,
. The sequence represents one bike borrowed at
, one borrowed at
, and one returned at
, and so on. The total number of bikes borrowed on node
during
t period is
. Equation (
9) represents the number change of node
.
Definition 5 (State). We use the discretization method to divide the time evenly into T parts, and all periods are represented by the set . The state of the environment at the time t is characterized by tuple , where represents the maximum capacity of the station and is a constant. represents the weather and other context environments. The context environment can be considered invariant when the overall system is suitable for riding bikes.
Definition 6 (Reward). Based on current state and action , Bike Gym returns reward .
Based on the observed status , a set of actions is passed to the step function of Bike Gym. According to the operation law of the environment, the step function outputs reward and updates the system state to . It is worth noting that the next state here is obtained not by calculating the transition probability but by system simulation. We put the interactive experience into the memory pool, including the historical state, action, reward, next state, and stage. The phase here is 0, which means there is a next state. If it is 1, the next is the termination phase.
3.4. Memory Pool Initialization
Our goal is to propose a new pricing algorithm in the bicycle rebalancing incentive system (BRIS) under a limited budget and in consideration of long-term interests to promote the rebalancing of bicycle supply and demand. Therefore, the optimization objectives are as follows:
In the face of the number of city-level stations and detailed time division, although the critical network eventually guides us to obtain the improved incentive strategy, the tortuous process experiences a lot of troubleshooting tests and consumes a lot of memory. The efficiency is improved if a correct direction is provided initially. Therefore, we conduct relevant operations.
- step 1
The range of incentive strategies is limited, with a maximum value and minimum . For the convenience of calculation, the incentive action of the station is discontinuous, with .
- step 2
Bike trajectories are not affected by the time interval size. The range of time intervals is expanded in the initial exploration process of action. The stations are classified into four categories according to the frequency of vehicle borrowing and returning at stations and the differences in traffic flow by bike operator. We select stations of key protection types and their neighbors as a subset of sites for testing (seen in
Section 5.1).
- step 3
On the screen, we randomly test a set of finite actions. Store valid action sets according to feedback from Bike Gym.
- step 4
The neighborhood search algorithm [
27] is used to expand the time domain, vertex domain, and action domain, and a larger and finer screen is used to select good feedback rewards at the intersection and save the filtering results.
- step 5
For all stations, the action value at the screen node is produced by the previous process and remains unchanged. In contrast, the action value at other stations is generated randomly. The action of all stations is transferred to the step function. The generated interactive data are stored in the memory pool.
4. Spatiotemporal Rebalancing Pricing Algorithm
In this chapter, we mainly introduce the ways in which the agent in MDP determines incentive strategies. The basic principles of MDP are presented in
Section 4.1. We propose a deep reinforcement learning algorithm—a spatiotemporal rebalancing pricing algorithm (STRPA) based on an actor–critic framework [
5,
27,
32]. STRPA corresponds to the role of the agent in MDP. STRPA includes two sub-modules: actor module and critic module. The actor module extracts the spatiotemporal characteristics of the bicycle network to determine the current incentive strategy (also called the action in this article). In contrast, the critic module evaluates the long-term benefits of the current incentive strategy. Finding the best incentive strategy is the goal of our model. The optimization methods for two sub-modules are described in
Section 4.4. After extensive practical experience and model optimization, the incentive strategy determined by the improved model better solves the problem of bicycle rebalancing.
4.1. Markov Chain
A four-tuple
represents a Markov process [
5], where
S represents the state space,
represents the action space,
R represents the immediate reward space, and
represents the discount factor. As shown in
Figure 3, after the current state
receives action
, the environment provides feedback
, and the current state is transferred to
. As time goes on, a Markov chain can be obtained. Our bicycle rebalancing aims to consider the long-term benefits after performing actions in the face of the current state. In time series prediction, the greater the amount of future prediction results, the worse the accuracy. In addition, the current available gain [
5] is more important than the future gain. Therefore,
represents a gain–loss factor relative to future time.
Lemma 1. The future gain of any Markov chain can be a formula, and the recursive relation can alsp be a formula, as follows: If we can determine the future gain of the Markov chain, we can choose the optimal incentive strategy (Action) at each time stage to obtain a Markov chain with the largest gain. In this Markov chain, we can obtain the incentive strategy of the whole stage, thus completing our rebalancing task. Determining the future benefits of each Markov chain is a huge project so that we can acquire the gain expectation of future possible states.
Lemma 2. The state-action value function [5] of the Markov chain can be a formula, and the recursive relation can also be a formula, as follows: Note that is a conditional expectation for state and action .
The state space and action space in MDP are continuous high-dimensional spaces. After we use the numerical discretization method to change the infinite action space into a finite set, even if the optimal action is selected by the instantaneous reward rather than the state-action value function, the workload of selecting the optimal action needs a lot of computation and storage space. Furthermore, the probability function cannot determine the action, for it is unknown. We must estimate the probability of selection action through random strategies with many samples. This work is huge, and the results are random. We use the deterministic strategy to describe the relationship between the current state and action, that is,
. This part is defined as the
actor module (as shown in
Figure 1). To obtain the optimal strategy, we use the state-action value function to guide the promotion of the optimal strategy. This part is defined as the
critic module.
4.2. Actor Module
In
Figure 1b, we demonstrate the framework of the actor module. The actor module is no longer a traditional MLP but introduces GRU [
11], embedded models, and GCN modules [
13,
14]. The input data of the actor module are the state in the environment, defined as a multidimensional vector in
Section 3.3. In addition, we found that the more time information is contained, the more beneficial it is for the model to develop incentive strategies. When this conclusion maps to the real world, the state change in the bike network has a certain delay due to the bike journey with time span; it is not instantaneous. Facing the historical state sequence, we use the GRU module to extract relevant time characteristics.
where
indicates the number of historical stages.
In addition, the context information in the bicycle environment, such as weather and temperature, is also related to time. We encode the context information and add it to the feature after extracting time information through the embedded model
E.
The attributes of nodes that constitute the state vector change along with the relationship between nodes. The environment part of
Figure 1a shows the partial node relationship diagram. Information that can be obtained from the diagram is
and
. The incentive policy of node
receives the influence of
and then the secondary effects of
. Therefore, the graph structure in the entire network eventually affects the incentive actions for each station. We define matrix
A to represent the relationship between nodes in the network.
represents the relational matrix with self-edges, with
, where
I is the identity matrix. In graph theory,
A represents the relationship between first-order neighbors, and
represents the relationship between second-order neighbors. The graph convolution network contains the information of matrix
A, so it can combine the characteristics of nearby nodes. Therefore, the formula of the graph convolution network is as follows:
where
L is the Laplacian matrix of
A when
is the input of the
graph convolutional layer,
is its output,
are learnable parameters, and
represents the activation function, for example, tanh.
represents
. Not only does graph convolution network extract the spatial characteristics of nodes, but it also reduces the difficulties caused by dimensions. Due to the limitation of matrix
A, the price (action) of every node is just related to several adjacent nodes rather than the whole nodes.
Actor module contains GRU, the embedded model, and GCN. The output is the incentive strategy (action). When the difference between any two incentive strategies and is very small, the feedback gap from Bike Gym is not large during the model’s learning process. Therefore, we add a small discrete function after the action value, preventing Bike Gym from falling into a dead cycle. The incentive strategy is inputted into the step function of Bike Gym.
4.3. Critic Network
As mentioned in
Section 4.1, we need a
state-action value function to evaluate the current state and action (incentive strategy) to guide the optimization of the actor module. In the critic module, we establish a deep network
to approximate the real state-action value function
. According to Lemmas 1 and 2, our deep network should also have the following relationships:
Next,
needs to be calculated. Similarly, facing the state space and action space of the bicycle environment, the dimension problem is still the focus of the solution. Seijen et al. proposed an HRA structure based on DQN [
18], which decomposes the overall reward function into multiple components and uses the low dimensional function to approximate the overall value function [
5]. The value function of each component depends on the feature subset, and it makes the overall value function smoother. Pan et al. [
9] proposed an HRP structure [
9] based on a DDPG [
28] framework combining the factors of time and space. When the hierarchical structure is adopted, and the low-dimensional function is used to approximate the high-dimensional value function, a general formula is proposed to approximate the whole station-action value function, as follows:
The left side of Equation (
17) represents the overall state-action value function [
5]. The first part on the right represents the sum of the value functions of each node, and the second part on the right represents the sum of the value functions of local nodes. It can be seen from the formula that the optimization of the overall value function should focus on the sum of the value functions of local nodes.
In the bicycle network, the cross-neighbors of stations are defined, and the value function of the feature subset of the neighbors is used to smooth the overall value function. Based on the above theory and practice, we propose a general Equation (
17) to approximate the state-action value function. In this module, the formula used is as follows:
where
are learnable parameters.
is a parallel symbol, indicating that variables
and
are transfered at the same time. The first item on the right side of the formula represents the sum of the independent value functions for all nodes. The second item on the right side of the formula represents the value function from the combined value function from the local nodes of every node. The processing idea of the critic module in this paper are somewhat similar to those of HRP, but our formula is more widely used than the HRP. When matrix
A represents the neighborhood relationship of the cross, we can solve the problem which the HRP could solve.
4.4. Optimization Objective
We can temporarily determine a set of evaluation criteria
, but there were better criteria initially. The more valuable the evaluation, the better the action guided by the critic produced by the actor module. Therefore, the closer
approaches the real state-action value function
, the better the evaluation criteria. According to Equation (
16) and the actor module, we can obtain the objective function to optimize
where
means immediate reward.
is determined on the next state
under the incentive strategy
provided by the actor module. The evaluation of next state
and next action
is computed through the evaluation criteria
.
The loss function between the real gain value and the state-action value is constructed. The critic module is optimized through the backward propagation mechanism. The incentive strategy generated under the excellent evaluation criteria is excellent. The loss function is as follows:
where MSE stands for mean square error. As mentioned above, the higher the evaluation, the better the action. The loss function of the actor module can be expressed as follows:
Therefore, the model is divided into four parts: actor, actor–target, critic, and critic–target. The actor–target and critic–target do not have a loss function, and their parameters are regularly duplicated from the actor and critic parts. A relatively smooth soft update method is adopted for the parameters of actor and critic, where
is a hyperparameter.
5. Experiment
5.1. Data Analysis
The data set used for the experiment was obtained from the Nanjing public bicycle company (PBC). It contains data from 20 September to 26 September 2016. We obtained the user trajectories, including origin and destination stations and spatiotemporal information. After deleting the data inconsistent with the time and the blank site, there were 980,036 valid data pieces and 1089 actual stations. Each travel record included the user ID, travel start (end) time, travel start (end) station, station name, station classification, station longitude and latitude, and other information. The bicycle system has 34,729 bicycles (about 35,000), 227,835 users, 41,672 pile positions, and the average loading rate of 0.83.
In
Figure 4, we show the spatial location of bicycle stations. Stations are divided into four types according to their colors: small margin type (SD), dispatching demand type (DB), dynamic balance type (SM), and key protection type (KG). The operating company determines the division standard according to the frequency and flow difference of vehicle borrowing and returning at the station. As seen in the figure, the collection of different types of stations is overlapping and cannot be completely separated according to geographical location. In the ellipse, the distribution of stations near Zhongshan Road is enlarged. It can be seen from the ellipse that there are different sizes of bubbles. The bubble size indicates the maximum number of bikes stored at the station. This information implies that nodes in the bicycle network have strong heterogeneity. In addition, bicycle stations are mostly distributed along the traffic trunk road, and the bicycle journey trajectory (such as the black dotted line in the ellipse) is closely related to the traffic route. Using the time distance between bicycle stations is more reasonable than the straight distance between stations. Since the cycling speed of bicycles is relatively stable, we use the average cycling time between stations to represent the distance between stations to avoid obtaining a large number of calculations of track length between stations.
Next, we look at the temporal characteristics of the bike’s borrowing or returning behavior data in
Figure 5.
Figure 5a,b are the borrowing frequency diagrams of bicycle stations (not shown because the return frequency diagrams are similar). It can be seen from the figure that there are two peaks in the morning (7:00–9:00) and evening (17:00–19:00) on weekdays except Thursday. This phenomenon is called the tide phenomenon. We check the weather conditions on Thursday and find that it rained in the afternoon, reducing people’s bicycle travel behavior. On weekends, the tide phenomenon is not obvious. This case shows that people’s travel demand is greatly affected by commuting. The direct reason for the BRP is the imbalance between user demand and supply of bicycle stations. The changing trend of the net flow of bicycle stations over time is shown in
Figure 5c, and the changing trend of the difference between inflow and outflow is shown in
Figure 5d. The peak period in the morning and evening is still the period with the largest fluctuation in the net flow. In addition, the imbalance of working days is much more serious than that of weekends.
Figure 5a–d shows the overall performance of all sites, meaning that most nodes have this characteristic performance.
Figure 4 and
Figure 5 show the spatial and temporal characteristics of bicycle networks, respectively. Processing spatiotemporal data is one of the important features of our model, which is also its original intention. Our model is based on long-term interests, so we can prepare in advance in the face of
tidal phenomena to alleviate the imbalance between supply and demand in the bicycle environment.
5.2. Experimental Setup
Using the method of deep reinforcement learning to solve the problem of bicycle scheduling rebalancing, we adopted three existing baselines:
DDPG [
28]: Deep Deterministic Policy Gradient algorithm. It can solve the small problem with low dimensional space.
HRA [
19]: Hybrid Reward Architecture decomposes the whole reward function into many components, which only depends on a subset of all features to solve a high-dimensional representation using a deep network.
HRP [
9]: Hierarchical Reinforcement Pricing Algorithm is a deep reinforcement learning algorithm that builds upon DDPG and uses a divide-and-conquer structure with an embedded localized to capture spatial and temporal dependencies.
BRIS is used to rebalance bicycles based on greed. Theoretically, as long as the temptation is big enough, all users can be fully inspired to complete all rebalancing work. It shows that our method has feasible solutions. Then, considering the BRP from the user’s perspective, we adopt the reinforcement learning method to reduce the cost of bicycle rebalancing. Bike Gym should provide strong negative feedback if the purpose is violated. Therefore, detection is added in the
step function according to Equation (
5). If the detection is unqualified, Bike Gym provides a maximum negative value to the export reward. If unqualified detection is in the process of initializing the memory pool in the Bike Gym, the memory pool is cleared. In the process of resetting, the threshold range of the incentive strategy (action) is narrowed until the detection is qualified. This work can force the reduction in costs and optimize the model in the right direction from the beginning.
The memory pool is a space for storing interactive experience bars, and the number of experience bars stored is limited. It has a queue structure. The first experience bar is discarded when the storage is full, and a new experience is added. It ensures that the latest experience in the memory pool is always stored for a period to improve the STRPA model. The position of each experience bar is
, where
indicates the termination phase; otherwise, it is not terminated. It is worth mentioning that our memory pool is different from the buffer we temporarily built in initializing actions (as shown in
Section 3.4).
In the general training process of reinforcement learning, a small exploration probability of is set to randomly generate action values and the probability is used to determine action values generated by the action module. That setting is implemented to increase the possibility of exploration.
Therefore, at the beginning, the action module randomly generates actions without criterion, and the critical module randomly evaluates without a target. After many experiments, the action module and the critic module are placed on the right track. But this must require a lot of useless work for our large state and action spaces. Firstly, the critic module is used to load the experience bars in batch from the memory pool that Bike Gym effectively initialized to improve the evaluation criteria of the critic module. Then, the method of exploring actions is changed. Probability to randomly generate actions and probability of to acquire actions by neighborhood search algorithm, as well as the probability to determine actions by action module , are emoloyed. This method can be used through to expand the scope of action exploration and use to determine local effective actions. This improves the training speed. However, it is highly likely that the model cannot find the optimal solution, and it exports a suboptimal solution.
Our model must divide time discretely, and its time interval is set to = 10 min. When GRU extracts time features in the action module, the number of whole-time stages is set to . Therefore we extract the time information characteristics of the past hour. In the critic model, we must import the state and action vectors. When the two are not at the same quantity level, the accuracy of the state action value function is compromised. Therefore, we normalize the input data through the maximum and minimum and define the reverse operation. In addition, the relationship matrix of nodes may be transformed to an unweighted matrix by threshold or a weighted matrix through the maximum and minimum. The normalization of data is beneficial to the stability of the model and the reduction in training difficulty.
5.3. Model Component
Compared with the traditional model DDPG [
28], our model introduces great changes in both the actor module and the critic module. The change in different models causes the accuracy of the model to be improved. Therefore, this paper designs two contrast models: STRPA-A and STRPA-C. The structure of the actor module remains unchanged in STRPA-A, while the approximation formula of the state-action value function in the critic module is as follows:
where
is the state-action value function of node
. The left of Equation (
24) is the state-action value function of the whole network.
In STRPA-C, the structural design of the critic module in STRPA ia inherited, while in the actor module, the incentive strategy is formulated according to the following formula:
It is worth noting that each node is decomposed without relation to its neighbor nodes. The current state of a node determines the current policy of the node. At the same time, the formula also ignores the heterogeneity of nodes. In these two replacement models, we separate the node independently. It eliminates the impact of matrix A.
5.4. Adjacency Matrix
The action module and the critic module in
Section 4 both influence the relationship matrix
A. Different matrices represent different relationships between nodes. In this paper, we set two kinds of matrix relations. The nodes in the bicycle network are attached with relevant information on the geographical location, and the distance between the nodes can be obtained. The spatial topology between the nodes can be obtained by setting the threshold according to the distance. The matrix defined according to the proximity relationship between nodes is called the neighborhood matrix (NM) in this paper. In addition, adjacent nodes are not necessarily closely related. The matrix is obtained according to the close relationship between the nodes. In this paper, we call it the pattern matrix (PM). According to whether the thresholding method is used, these two matrices can also be divided into the unweighted neighbor adjacency matrix (NAu), the weighted neighbor adjacency matrix (NAw), the weighted pattern adjacency matrix (PAu), and the weighted pattern adjacency matrix (PAw). Four adjacency matrixes are brought into the model to test the effect of the model.
6. Visualization
In
Section 5.2, we set three baselines. Compared with DDPG, the structure in other models in the critic module is established based on the hierarchical principle. Although, theoretically speaking, DDPG can complete the task of reinforcement learning. However, because the number of nodes is too large during training, as shown by the gray dashed line in
Figure 6, the loss value of DDPG cannot drop for a long time, and the model cannot reach a stable state. In the structure of the actor module and the critic module, the loss value of other models fluctuates and decreases with the increase in training rounds and finally tends to be stable. To better show the effect of the models, we can display the performance of these models in
Table 1. Key performance includes Q loss, A loss, MAE, and RMSE. Q loss represents the loss value of the criticism module of the different models trained. The loss represents the loss value of the participant module of the different models trained. MAE and RMSE represent the average absolute error and root mean square error between the reward predicted by the training model and the reward replayed by the environment. We provide the maximum, minimum, and average values of the key performance of the model.
It can be seen from
Figure 6 that the loss value of DDPG model fluctuates with the training rounds. In contrast, the loss value of HRA, HRP, STRPA, STRPA-A, and STRPA-C fluctuates with the increase in training rounds, indicating that the model can be optimized and tend toward stability. From the values of the ordinates in
Figure 6a,b, it can be seen that the loss value of the critic network is higher than the actor loss value. This is because the loss value in the critic network is related to the parameters in the critical network and the parameters in the actor network. The physical meaning of actor loss represents the average expected value of the value of the state action of the node. When the number of nodes is fixed and the bicycle environment is unchanged, it can be seen that the lower the expectation of the state action value of the node, the lower budget for rebalancing the bicycle. Therefore, the rebalancing cost of the bike is reduced. The physical meaning of the loss function of the critic module represents the difference between the state value function of the node and the true gain returned. The lower the difference, the better the effect of the simulated real return value of the critical network. In
Figure 6, STRPA demonstrates its outstanding capabilities in both the actor module and the critic module.
We can see that models STRPA, STRPA-A, and STRPA-C can become stable after training and can be used to formulate a pricing strategy (action) for node detours in the bicycle network. To better compare the performance of the model, we used the same matrix and the model performance in different periods under the same bicycle environment. The sub-chain of three periods is extracted in the Markov chain. These three time periods include morning peak, afternoon peak, and evening peak, and are defined as the morning stage (M), the noon stage (N), and the evening stage (E).
The bar graph in
Figure 7 represents the MAE and RMSE mean values of the three models in three periods under the unauthorized neighborhood adjacency matrix. From the perspective of MAE and RMSE of the model, the STRPA model performs best, followed by STRPA-A. In terms of time stage, the best result is at night. Because bicycle travel is more susceptible to commuting, the effect of its node detour strategy at night is more easily compromised, thus achieving good results.
Graph convolution network can mainly aggregate the information of the neighborhood nodes through the adjacency matrix, so the different adjacency matrices result in different impacts of certain models. It can be seen from
Section 5.4 that there are four kinds of matrices, including NAu, NAw, PAu, and PAw. The STRPA model with four matrices is selected to explore the influence of the matrix. It can be seen from the three charts that the performance of the model with different matrices is excellent, respectively, in different periods. In general, the performance of the unweighted matrix is better than that of the weighted matrix, and the performance of the neighborhood adjacency matrix is better than that of the pattern adjacency matrix. The unweighted matrix is more accessible to obtain than the weighted matrix, and it can minimize the difficulty of the model. The reason for the better performance of the neighborhood matrix is that the node relationship assumed to influence detour decisions is more likely the neighbor relationship constructed by the geospatial structure than the patterned relationship constructed by the similarities between node characteristics. From this, we can see that the construction of our matrix should be closer to the physical meaning of the model.
In
Figure 8a,b and
Figure 9, we show the reward curves of three models in the early, middle, and late periods. It can be seen that the STRPA model is the best among the three models in terms of the stable performance of reward and the fluctuation of the curve, and we can see that the performance of the three models in the same period has little difference in reward. On the contrary, the performance difference of the same model in different periods is a little larger. This is due to the degree of internal imbalance of the bicycle network, which is independent of the model itself.