1. Introduction
Optimally developing an oil and gas field requires predicting future production using a reservoir model, whose key material properties are tuned in a process called history matching. This process of adjusting the key parameters is non-unique and computationally challenging. Typically, the reservoir model is divided into cells that match the geology of the field. The key properties of these cells, such as porosity and permeability, are assigned initially using core sample data, where available. For computational efficiency, the geological model is converted to a reservoir model using upscaling [
1,
2,
3] to reduce the number of the cells in the model.
Due to the challenges of finding the key properties in each cell, history matching is used to adjust the values of these properties so that the model reflects historical production data [
4,
5,
6]. History matching is typically done by matching the computed pressure and saturation data (oil, gas and water rates) from the simulation model and comparing them to the actual historical data. The difference between the actual data and data generated by the reservoir model is then computed using an objective function that quantifies the mismatch between the two quantities. The problem of history matching can be expressed mathematically as a nonlinear least-square optimization problem, where the optimization algorithm minimizes the objective function:
where
is a constant scaling factor used to scale the quantity of the objective function,
q is a vector containing the actual historical pressure or saturation quantity, and
is a vector that represents the calculated quantity from the simulation model.
is a weight matrix used to assign weights to production data where it can be used if there is an order of magnitude difference in the production data.
u is a vector containing the uncertain parameters in the reservoir model. The rates generated by the reservoir simulator
can be expressed as a function of the reservoir simulator with respect to
u, i.e.,
. The second half of the equation is a regularization term where
refers to the regularization parameter,
refers to an a priori estimate of the uncertain parameters
u, and
is another weight matrix that can be used to assign weights for each uncertain parameter.
The history matching problem is an ill-posed problem where multiple solutions can be found to match historical data [
7,
8,
9]. Due to such ill-posedness, multiple solutions are used to better assess the uncertainty in the forecasts [
10,
11]. The regularization term in the objective function in Equation (
1) is used to mitigate the effect of ill-posedness of the history matching problem [
12,
13].
Reinforcement learning is a field of machine learning in which an artificial deep neural network agent learns in a manner similar to the way humans learn by interacting with an environment and using a trial and error mechanism. The reinforcement learning agent, as shown in
Figure 1, observes the state of the environment, takes an action and collects a reward for its action. The agent will learn from these interactions how to search the parameters space in order to maximize its rewards. Reinforcement learning gained more popularity after van Hasselt et al. [
14] showed that a reinforcement learning agent was capable of achieving super human intelligence just by monitoring pixels on the screen. However, reinforcement learning is more generic, and is suitable for all problems that can be formulated as a Markov Decision Process (MDP).
Section 2 of this paper reviews the literature of the history matching problem and discusses the current methods used to solve it. After that,
Section 3 of this paper shows the details of the two reservoir models used as test cases and how their historical data are obtained. Then,
Section 4 explains in detail the methodology in which the suggested algorithm is developed and how it can be utilized to find multiple history-matched models. Next,
Section 5 shows the results of applying the algorithm on the two datasets and how the algorithm scales when increasing the number of computing resources. After that,
Section 6 offers a thorough discussion and analysis of the algorithm and results along with future research work related to improving the algorithm.
2. Literature Review
History matching can be done manually where an experienced professional tunes the parameters of interest to find a matching model. However, manual history matching can be time consuming and prone to human bias and error [
16,
17]. History matching can also be done with the help of computers, which is known as assisted or automatic history matching. In automatic history matching, different methods have been used to search for parameters values that minimize the objective function. The methods used in the literature include gradient-based, stochastic or probabilistic algorithms.
Gradient-based algorithms, such as Levenberg–Marquardt, are exploitative by nature where the algorithm follows the gradient direction in order to search for a minimum. Gradient-based algorithms can be powerful as they can convergence quadratically to a local minimum under certain conditions [
18]. However, these algorithms are not feasible for large problems with a large number of parameters [
19].
Stochastic optimization algorithms such as Genetic Algorithm which is inspired by the biological evolution of genes can better explore the parameter space. By allowing the parameters to evolve independently and then combining the results to choose the model with least error, Genetic Algorithm can handle a large number of parameters [
20]. However, they require a large number of function evaluations compared to gradient-based algorithms [
21]. Probabilistic algorithms, such as Ensemble Kalman Filter (EnKF), use an ensemble of realizations to represent the model uncertainty where these realizations are updated using a variance minimizing scheme [
22]. EnKF requires fewer function evaluations but may not converge, and the parameters are assumed to have a Gaussian distribution [
20,
21].
In recent years, machine learning has been employed extensively to tackle research problems thanks to its capacity to map input data to a useful output by identifying certain features. In the literature, machine learning tasks usually fall into one of the three categories: supervised learning, unsupervised learning, and reinforcement learning [
23]. In supervised learning, a model is trained with labeled data, i.e., examples in order to later predict new data with no label. In unsupervised learning, the model is trained with unlabeled data points to identify certain features and subsequently classify them into clusters. In reinforcement learning, the model learns a policy that maximizes a certain reward in an environment of interest. The policy is often learned by directly interacting with the environment [
24].
Machine learning has been applied to a wide variety of energy-related problems. Montgomery et al. [
25] used supervised learning to build data-driven models to forecast shale gas production and Zhang et al. [
26] developed a multi-component method for reservoir characterization using unsupervised learning. Miftakhov et al. [
27] used reinforcement learning to maximize the Net Present Value (NPV) of waterflooding by training a reinforcement learning agent to control the water injection rate. Bildirici and Ersin [
28] combined deep neural networks with Markov-switching auto-regressive vector to forecast economic growth and gasoline prices.
Li and Misra [
21] showed that the history matching problem can alternatively be solved by formulating it as a Markov Decision Process and then utilizing reinforcement learning methods. While Li and Misra [
21] showed a successful proof of concept, their work was limited to a simple model with a small number of parameters. The scalability and applicability of such approach to a more realistic complex model is yet to be established. This research paper introduces a novel algorithm where the use of a parallel stochastic reinforcement learning policy can efficiently find multiple solutions to the problem using a more complex 3D model and up to 27,000 uncertain parameters. Such a parallelization allows for utilizing more computing resources to find multiple solutions to large models in a timely manner.
Solving the history matching problem in parallel using deterministic gradient descent algorithms can be complicated as such algorithms require knowing the previous objective function before taking a minimization step. Due to the fact that this kind of problems requires a great deal of function evaluations, a lot of research work has been done in order to parallelize the problem or some aspects of it. In the literature, the ensemble Kalman filter was used to solve the history matching problem in parallel [
29,
30,
31]. Tanaka et al. [
32] developed an optimization workflow in which the field development optimization can be solved in parallel. Sarma et al. [
33] showed that the model-based optimization and uncertainty quantification can be massively parallelized across thousands of commercial cloud computing nodes.
However, to the best of the authors’ knowledge, at the time of publishing this paper, no previous work has shown that the history matching problem can be solved in parallel using reinforcement learning. In addition, no work has been done previously to show that employing a stochastic policy in reinforcement learning can lead to finding multiple and different solutions to the history matching problem.
4. Methodology
4.1. Markov Decision Process
An MDP represents the sequential decision making paradigm in which the actions taken by the agent does not only affect the immediate reward but it also affects future long-term rewards and future states [
15]. MDPs are often defined by the tuple
, where
denotes the state space,
denotes the action space,
denotes the transition kernel,
denotes the reward function, and
denotes the discount factor. For the transition kernel, we use
to denote the probability of transitioning from state
s to state
if action
a is taken. We denote a stochastic policy by
, where we use
to denote the probability of choosing action
a when in state
s under policy
. Under each policy, the value function
represents the reward-to-go when starting from state
s and using policy
. Specifically, we have
where
, and
, and the expectations are naturally taken with respect to the randomness in the policy and the transition kernel. In reinforcement learning, the goal is often to estimate the optimal policy
or the optimal value function
.
4.2. History Matching Using Reinforcement Learning
In this paper, we utilize reinforcement learning to solve the history matching problem by formulating it as a Markov Decision Process as illustrated in
Figure 8. By using the reservoir simulator as an environment where the reinforcement learning agent
can take action
and receive the new state of the environment
along with a reward
quantifying how good the action taken by the agent is. The current state returned by the environment is a vector containing all the uncertain parameters that need to be tuned in order to match the history and is equivalent to
u in Equation (
1).
For both cases used in this research paper,
,
, and
permeability values of each cell are stacked into a vector as follows:
This vector is considered as the state of the environment
. The action
taken by the agent is also a vector with the same dimensions as the state, allowing the artificial deep neural network agent to act on the uncertain parameters and adjusting them to find a solution. At each reinforcement learning time-step
t, the action taken by the agent is given a reward value
to quantify the quality of the action which allows the agent to learn whether the action it took is good or bad. In this work, the Open Porous Media Flow reservoir simulator [
38] is used in the reinforcement learning environment.
Formulating the history matching problem in such manner allows for the use of reinforcement learning, where the agent will be able to learn from the reservoir simulator how to take actions that lead to minimizing the objective function. The agent is allowed to only select actions that are within a certain range, i.e., the agent action space
at each entry of the action vector is =[
,+
]. It is important to note that
(see
Appendix A) is a design parameter that can be chosen by the engineers running the experiment based on their knowledge of the reservoir model at hand. The choice of
must not be too small nor too large. Choosing a small
value will change the objective function by a very small quantity, causing very slow convergence. While, choosing a large value can encourage the agent to take actions that push the permeability values out of allowed bounds.
If the agent takes an action that causes the current state to be out of bounds (outside of all possible state space ), then the current state is clipped to satisfy physical restrictions. For example, if the agent takes an action that results in negative permeability values in some of the vector entries, then these values are set to zero in order to avoid feeding the reservoir simulator negative permeability values causing it to crash. The parameter is similar to the step size parameter in gradient-descent optimization algorithms where selecting a very small value will slow the convergence process and a big value may cause divergence. Another approach can be used to avoid reaching states with negative permeability values is to change the actions of the agent to multiply the current permeability value instead of addition or subtraction. By setting to a small value, to avoid large updates, for example, 0.1, the agent can sample an action between and in order to tune the permeability.
Reinforcement learning process consist of episodes where an episode is defined as a series of reinforcement learning time-steps starting from the initial state and ending when a termination criterion is met. This termination criterion is usually governed by a reward function. Typically, in reinforcement learning, the reward function is designed in a manner where the agent is given a positive reward for good desired actions and a negative reward for bad undesired actions. However, choosing a reward function can be a challenging task as sometimes there are good actions and even better actions or bad actions and worse actions which can be difficult to quantify. Luckily, in the history matching problem, the quality of the action can be measured directly from the objective function, where the difference between the value of the previous objective function and the value of the new objective function after the agent took an action can quantify the quality of the match.
In the context of applying reinforcement learning to solve the history matching problem, the reward function can be computed as:
is computed using Equation (
1) and refers to the value of the objective function at time = (
) before the agent takes an action, whereas
refers to the value of the objective function after the agent takes an action. This reward function will return a positive value if the agent took a good action that led to reducing the objective function and a negative value if the agent took a bad action that caused the objective function to become larger. The advantage of using this reward function is that it will assign larger rewards to actions that lead to a larger reduction in the objective function, thus helping to guide the agent to take actions that will speed up the convergence process. Assigning a higher reward to an action increases the probability of choosing that action again given the same state. The value of the objective function is scaled down using a scaling factor
in order to avoid feeding the deep neural networks very large quantities that might destabilize the training process.
The choice of the error quantification function can affect the reward function and subsequently the performance of the algorithm. Equation (
1) utilizes a weighted squared error formula for the objective function. The weighted squared error is used since it is the common choice for history matching problems and it would provide the agent with a good feedback that quantifies the quality of its action. However, a weighted root-squared error might provide a better error quantification for the algorithm as it would provide a somewhat uniform reward function to the agent unlike a squared error, which penalizes larger errors more severely, possibly leading to a non-uniform reward function.
The reward function defined in Equation (
4) is not enough on its own and it needs to address termination criteria when the agent takes an action that reduces the objective function to the tolerance level accepted. In order to address that, in this research paper, the agent is assigned a big constant reward (i.e., 50,000) to incentivize it with a big reward upon good episode termination. In addition, in order to save computing power and incentivize the agent to take actions that speed up the convergence process, the environment sets a time limitation [
39] on the maximum time-steps allowed per episode. Once the maximum number of time-steps is reached without meeting any termination criteria, the episode terminates with a constant negative reward (i.e., −20,000 ) if the agent does not find a solution. The maximum time-steps per episode parameter is also a design parameter and should be chosen based on the problem at hand but it must not be too small. Small time-steps per episode can hinder the agent from exploring the environment properly due to the fact that in reinforcement learning the agent sometimes might pick actions that are not good at the current time-step, but can lead to better rewards in the future time-steps.
Furthermore, if the agent diverges away from the solution by making a sequence of bad decisions rendering the objective function to grow large. In this case, the episode is terminated with a big negative constant rewards (i.e., −50,000) as soon as the objective function value becomes twice as large as the initial objective function. In addition to saving computing power, such a limitation on the value of the objective function will punish the agent so it can learn from that mistake and avoid following such a trajectory in the future. The rewards of termination criteria are also design parameters and the engineers running the experiment should choose proper values relevant to the problem at hand. The rewards should be chosen so that they can incentivize the agent towards meeting the tolerance criteria as well as providing a good feedback to the engineers so they can monitor the learning process.
This formulation of the history matching problem as a Markov Decision Process means that at each reinforcement learning time-step, a simulation run is required to compute the new objective function and assess the quality of the action taken by the agent.
4.3. Proximal Policy Optimization
In this work, we use the Proximal Policy Optimization (PPO) [
40] algorithm to find multiple solutions to the history matching problem. The PPO algorithm is an actor—critic method in which the agent utilizes two deep neural networks parameterized by
, one for the actor and one for the critic as shown in
Figure 9. The value function and the policy are represented by these two deep neural networks. Specifically, the actor and critic networks represent the policy and the value function, respectively.
One of the prevalent issues in policy gradient methods is the destructively large policy updates. To overcome this issue, PPO suggests a simple surrogate objective that penalizes large changes in the policy update. Specifically, the objective function in PPO is as follows
where
refers to the expectation operator,
represents the clipped objective function for the policy network, and
represents the error in estimating the value function. Specifically, for the policy loss, the clipped objective function is defined as
where
is a probability ratio of the new policy to the old policy computed as
,
e is a clipping range variable and is set to 0.2 in the case study, and
represents the estimate of the advantage function calculated as
where
is the value function estimate given by the value network. Note that the advantage function at time
t is defined as the discounted sum of rewards (starting from
t) minus a baseline. The value function at
is often used as the baseline, as done in PPO. The goal of the clipping operator is to stop drastic policy updates based on the noisy estimation
, and hence, it avoids drastic policy updates.
Further,
, which denotes the loss in estimating the value function, is defined as
where
are samples estimates collected from the environment, and
T is the maximum number of steps. Recall that the value function
represents an
estimate of the expected rewards for a given state
at a given time =
t. This estimate is estimated using the critic network, which along with the policy network, is updated during training. Recall also that
is a discount factor which controls the trade-off between short-term and long-term rewards. Specifically, as
decreases, more emphasis is given to short-term rewards.
While seemingly complicated, the objective function described above is quite intuitive. Minimizing ensures that the critic’s estimate of the value function is close to what the observed samples suggest. On the other hand, maximizing , if we ignore the clipping, encourages policies with a better advantage, i.e., policy that chooses relatively better actions. The clipping, as mentioned above, mitigates potentially destructive updates due to potentially huge values of .
Note that while PPO is used in this work, other actor–critic methods such as A2C [
42] which supports sampling from parallel environments can also work with this problem formulation. However, PPO outperforms other alternatives in terms of speed and quality of solutions found. In addition, PPO is sample-efficient which is very important in the history matching problem due to the fact that the objective function computation is costly.
In order to find multiple different solutions, it is important that we use a stochastic policy when we interact with the environment. Specifically, a stochastic policy will lead to a better exploration of the state-action space, and hence, it will find multiple solutions in each run.
The deep neural network design for the actor network
is shown in
Figure 10. It is composed of an input layer containing 27,000 neurons (equal to the number of uncertain parameters in SPE9); two hidden layers where each layer contains 4096 tanh-activated neurons; and an output layer that is also equal to 27,000 as it needs to take action against 27,000 different uncertain values. The critic network
, as shown in
Figure 11, has the same structure as the actor network except for the last layer, where it has only one neuron, since it is only trying to estimate a scalar value.
The second dataset uses the same structure. However, since the problem has much fewer uncertain parameters (900 compared to 27,000), it uses 900 neurons in the input and output layers. In addition, the number of the hidden neurons in the hidden layer is equal to 128 instead of 4096 used in the first dataset.
4.4. Parallel Reinforcement Learning History Matching
History matching, in general, is a serial problem where the use of optimization methods to find the minimum of the objective function via gradient descent makes it necessary to know the previous objective function before taking the next step. However, in reinforcement learning, the agent learns how to find the minimum by interacting with the environment one step at a time and collecting these experiences, then training the deep neural network agent. So, in essence, the deep neural network is trying to map states into actions so that the agent needs only the current state, the action taken and the reward assigned to this action in order tune its deep neural network weights and choose the decisions that will maximize the rewards. PPO allows for the experiences to be collected into a batch every few time-steps and then sent to the agent for training even if the episode is not completed.
The way in which the actor–critic reinforcement learning trains and optimizes its agent gives us the chance to speed up the history matching process by training the agent in parallel and allowing it to collect experiences from running multiple scenarios of the history matching process at once. This can be done by allowing the agent to simultaneously interact with
N number of environments, observe different
N states, take
N number of different actions, and collect
N different rewards as shown in
Figure 12. For example, in the case of SPE9, the batch size is set to 192 where the algorithm will run 192 experiments by interacting with the environment, then collects these experiments in a batch, then sends it to the agent to train its deep neural networks, and update their weights. Instead of waiting for a single environment to collect 192 experiments, the algorithm will launch eight environments in parallel where each environment collects 24 experiments which allows the agent to collect the 192 experiments in a much shorter time period.
Stable Baselines 3 [
43] implementation provides a wrapper that allows for the launch of
N CPU threads, each thread works on an environment to collect the experiences; then, these experiences are sent to the GPU to train the agent’s deep neural networks in a faster manner.
In addition to running multiple environments in parallel, inside each thread running an environment to collect the experiences, the environment itself can run the reservoir simulator in parallel to speed up the process of computing the objective function which is very common in reservoir simulation. For example, when running the experiment on SPE9, 4 MPI processes were used per environment.
However, it is very important to note that only the process of collecting the experiences and running the reservoir simulator can be done in parallel but each environment will run each episode in a serial manner, and thus, each environment must have its own desk directory with its own DATA, ECL, and SMSPEC files. For example, the state of environment 12 at time-step (
t) will depend on the state and the action of environment 12 at time (
) as this process cannot be parallelized. The details of the algorithm are shown in Algorithm 1 pseudocode.
Algorithm 1: Parallel Reinforcement Learning History Matching |
|
4.5. Reproducibility
One of the major challenges faced while dealing with deep neural networks is reproducibility [
44,
45] in which repeating the experiment may lead to a different result with a different run-time taken to find the solution. The reproducibility problem is caused by the random network weights initialization prior to training in addition to using stochastic optimizers to optimize such weights. Another factor contributing to the reproducibility problem is the stochastic policy used by PPO, where a small random noise is added to the action suggested by the agent or, in some cases, some of the actions taken by the agents are sampled from a random noise matrix to ensure that the agent explores the environment efficiently to find multiple solutions to the history matching problem.
The reproducibility problem does not affect the outcome of the results in the first dataset as all the solutions found meet the tolerance criteria. However, it becomes important when running the scalability test on the second dataset due to the fact that some runs might find a solution quickly and other runs may take a while to find a solution. For this reason, an approach similar to Alolayan et al. [
46] was used to reduce the effects of the reproducibility problem, where each of the scalability test run-time measurement is repeated 10 times and the average of these 10 runs is considered the final result.
5. Results
Algorithm 1 was used to run 20,000 reinforcement learning time-steps in parallel on the first dataset to search the parameters space for solutions to the history matching problem. As shown in
Figure 13, in the beginning of training, the algorithm behaved as expected from any reinforcement learning algorithm where it kept making wrong decisions and collecting negative rewards. As the agent spends more time interacting with the environment, it learns how to map states into actions that enables it to increase its rewards.
Per the definition of the reward function in Equation (
4), as the agent tries to increase its rewards, it will tune the uncertain parameters and find models that reduce the objective function. As the agent spends more time in the environment trying to maximize its reward, it will reduce the objective function to meet the tolerance criteria, and thus, it will find a solution to the history matching problem. As shown in
Figure 14, thanks to the stochastic policy the agent uses, it can find multiple different solutions as it tries to take different trajectories to better explore the environment.
The approach of using a stochastic policy may encourage the agent to take bad actions and waste computing power. However, it is necessary for the agent to do so in order to search for new different solutions as this is always the case with the exploration–exploitation dilemma. Such behavior can be seen from the agent’s learning progress as it appears in
Figure 13 around time-step 13,000. Although the agent found few solutions and collected positive rewards by that time-step, it did not repeat its actions to maximize rewards. Instead, the agent explored the environment for new different solutions causing it to make decisions that resulted in less rewards than previously acquired.
In order to study the forecasting capabilities for the model realizations found by the artificial deep neural network agent, the quality of the models are then validated by running each model for an additional year that the agent did not train on.
Figure 15 shows each realization forecast compared to the truth model and the starting point. As shown from the figure, the realizations found by the agent exhibited good forecasting capabilities for the reservoir model and provided multiple production scenarios where some models over-predicted and some models under-predicted. This can help engineers run uncertainty quantification analysis with a higher degree of confidence. When using this algorithm, it is recommended to reserve a portion of the historical data for validation in order to assess the quality of the solutions found by the agent. Additionally, the validation period can also be used by the engineers to further filter out unwanted solutions based on their own criteria.
To examine and measure how Algorithm 1 scales with the number of available computing resources, a scalability test was conducted on the second dataset. The scalability test was conducted using the second dataset as it has a fewer number of uncertain parameters and a smaller initial objective function, making the problem easier to solve resulting in a shorter run-time. A reasonable run-time is necessary for the scalability test as it requires the experiments to be repeated tens of times for reproducibility purpose as discussed in the Methodology section.
Figure 16 shows the capacity of Algorithm 1 to reduce the run-time when doubling the number of available computing resources. On average, every time the computing resources are doubled, a 41% reduction in run-time is achieved. When using 16 environments instead of one, Algorithm 1 reduced the total run-time from 18.6 min to 2.2 min, achieving an 88% reduction in the time needed to find one solution.
As shown in
Figure 17, Algorithm 1 achieved a speed up of 8.45 when using 16 environments compared to one environment. The algorithm scored an average speed up of 1.57 when doubling the number of resources compared to a speed up of 2 in the ideal scenario where the ideal scenario refers to the non-realistic case when an algorithm achieves a speed up of 2 every time the resources are doubled. The algorithm achieved a maximum speed up of 1.88 when running the algorithm on two environments instead of one, and a minimum speed up of 1.51 when running the algorithm on eight environments instead of four.
A major advantage of the parallelization procedure employed by Algorithm 1 is the fact that the number of reinforcement learning time-steps (number of forward simulation runs) did not increase significantly while increasing the number of computing resources, as shown in
Figure 18. Such property indicates that the algorithm can scale efficiently when adding computing resources. As shown in the figure, it takes the algorithm around 2000 time-steps to find a minimum and instead of taking 18.6 min running on one environment, it only took 2.2 min when running on 16 environments while slightly increasing the number of simulation runs from around 1990 to around 2133 runs.
6. Discussion
By reformulating the history matching problem from a mathematical least-square optimization problem into a Markov Decision Process problem, the suggested algorithm provided a way in which the history matching problem can be solved in parallel where adding more computing resources can speed up the convergence process. In addition, the suggested algorithm can be used to find multiple and different solutions to the history matching problem, allowing for better forecasting uncertainty analysis. By creating multiple environments and allowing the artificial neural network agent to sample from these multiple environments simultaneously, the algorithm had the capacity to tune 27,000 uncertain parameters. Such capacity allowed the agent to find multiple and different history matching models in a timely manner.
The multiple different solutions shown in
Figure 14 do not have to necessarily be 9 solutions as more solutions can be found. As a matter of fact, there are infinite solutions to the problem [
4] and if more solutions are desired to better assess the uncertainty, then the engineer running the experiment can run Algorithm 1 for a longer period of time. Giving the agent more time to learn from the environment will allow it to find more solutions. The process of training the agent for a longer time to find more solutions does not require the training process to be repeated from scratch. In order to avoid retraining, the agent’s knowledge can be re-used by saving the deep neural network models to the disk and then re-loading them later to resume training if the number of solutions found is not sufficient.
The results also show that the agent learns how to interact with the environment in order to maximize its rewards, where in the beginning, the algorithm will not be able to find any solutions until it learns how to map the current states of the environment into actions that reduce the objective function, eventually finding a solution. Naturally, the more complex the problem (i.e., the more uncertain parameter needs to be tuned), the longer the training needed. This is due to the fact that the agent will need more samples to explore and learn from the environment. In addition, the deep neural networks used to map the state into action will be larger as explained in the Methodology section, thus needing more data to be able to successfully find multiple solutions to the history matching problem.
It is important to also note that as the number uncertain parameters increases, the training process becomes more difficult. Such difficulty is expected as the higher the number of uncertain parameters, the larger more complex the deep neural network needed to handle the problem. A larger network can cause instability in training and would require a smaller learning rate that slows down the algorithm. We found out that adding a regularization term to the objective function would not only help mitigate the ill-posedness effect on the problem, but it also can help stabilize the learning process and allow for a larger learning rate to be used which can speed up the convergence of the algorithm.
Not only can the number of parameters affect the speed of convergence, the starting point also plays an important rule. Naturally, a model with a small initial objective function of 100,000 will converge faster than a model with an initial objective function of 500,000. This is an expected behavior that also occurs when using optimization algorithms to solve the problem where the closer the starting point to the solution, the faster the convergence. Faster convergence also can be achieved if the tolerance criterion is loosened as this is the case with other algorithms as well [
47]. Reducing the historical mismatch by more than 99% when tuning 27,000 uncertain parameters will always be difficult, as it will require most algorithms to get more samples from the reservoir model parameters. As with other algorithms, a trade-off between accuracy and speed will also need to be considered. Moreover, faster convergence may also be achieved if the maximum number of time-steps per episode parameter is set to a smaller value. However, this might hinder the algorithm’s ability to explore the parameters space efficiently, as sometimes the agent takes few bad actions on purpose in order to search for new solutions.
The run-time might be reduced by reducing the number of uncertain parameters using sensitivity analysis where the cells that have the least effect on the change of the objective function can be ignored and are not included in the uncertain parameters vector just like when dealing with inactive cells. The number of uncertain parameters can also be reduced using permeability multipliers, where instead of tuning each single permeability value, a group of constant multipliers are tuned, where each multiplier is used to multiply the permeability values in certain cells. However, in both datasets used in this research paper, no multipliers are used in order to show the capacity of the algorithm to tune thousands of parameters. Another approach to reduce the run-time is the use of proxy models [
6,
48,
49] where the reservoir simulator
is replaced by a fast approximate model
that can map the input of the reservoir simulator into an output within some acceptable accuracy.
A major advantage of using the suggested algorithm is its flexibility in terms of the parameters space. Unlike the Ensemble Kalman Filter (EnKF), it does not require the parameter space to have a Gaussian distribution. Another advantage is its versatility, where some optimization algorithms might diverge away from the solution if the initial objective function is large, this algorithm is less prone to divergence with an initial guess that is reasonably far from any solution if given enough time. This is mainly caused by the nature of this algorithm where it would restart itself to the starting point upon divergence. Then, thanks to the shaping of the reward function, the agent will be punished severely for diverging. By punishing the agent for diverging away from the solution, it will reduce the probability of choosing these actions in the future to enable the agent to learn from such mistakes and not repeat the same actions again that lead to divergence.
It is important to note that when using a very large number of computing resources across hundreds of environments, the algorithm scalability will be lower compared to the case when utilizing tens of environments. This is due to the fact that the batch size (1024, 512, etc.) may not be big enough to divide the work efficiently amongst the environments. When this occurs, the batch size would have to be fixed when increasing the number of environments causing PPO to be less sample efficient as resources increase. In that case, a good speed up is still expected to be achieved when adding more computing resources but Algorithm 1 might not scale very well compared to its scaling performance when using tens of environments. When dealing with a very large model that requires hundreds of environments, engineers will have to spend some time adjusting the hyperparameters and ensuring efficient workload for each environment in order to achieve high scalability.
One of the major challenges arising from using Algorithm 1 is the number of hyperparameters related to each experiment. In addition to the usual deep neural networks hyperparameters (batch size, learning rate, number of neurons, etc.) and reinforcement learning hyperparameters (, maximum time-steps, etc.), the experiment itself has its own hyperparameters, such as and maximum time-steps per episode, as these parameters are problem-specific and depend on the reservoir model at hand. The engineer using the suggested algorithm might have to spend some time trying to find some optimal parameters that can make the algorithm converge faster.
The vector containing all uncertain parameters u does not have to only contain permeability values. Any uncertain parameters, for example, the porosity , can also be included in u. However, the engineer running the experiment must also adjust the action space to take into account the scale of change between different properties at each reinforcement learning time-step, where the maximum change allowed for the permeability value might differ from the maximum change for porosity .
A future research opportunity can be explored by extending this framework using Multi-Agent Reinforcement Learning (MARL) [
50,
51,
52] instead of using a single agent across multiple environments. As shown in
Figure 19, the multiple agents can work collaboratively towards reducing the objective function where each agent is responsible for a section of the reservoir model. Each agent can observe the entire environment or just a part of it to take actions. Such an approach can allow researchers to tackle very large models because each agent would have to map an easier function with a smaller number of actions. The small number of actions per agent would result in smaller deep neural networks, requiring less training time.
However, such an approach comes with its own difficulties as well, where it would be difficult to come up with an appropriate reward function. The most difficult part would be figuring out a way to reward good actions and punish bad ones, where at each time-step, it would be difficult to identify which agents took good actions and which agents took bad ones from a single reward function such as the one used in Equation (
4).