Comparing Reinforcement Learning Methods for Real-Time Optimization of a Chemical Process
Abstract
:1. Introduction
1.1. Motivation
1.2. Literature Review
1.3. Contributions
- This work represents a first-of-its-kind comparison study between PPO and other methods for real-time optimization. Specifically, comparisons are made to maximum production operation (no optimization), optimization using an ML model (artificial neural network) and particle swarm optimization (ANN-PSO), and optimization using a first principles model and gradient-based non-linear programming.
- Our results demonstrate that PPO increases profitability by 16% compared to no optimization. It also outperforms ANN-PSO by 0.6% and comes remarkably close to matching the performance of the FP-NLP method, getting within 99.9% of FP-NLP profits.
- Though more time must be invested into training the system, PPO reduces online computational times, resulting in 10 and 10,000 times faster computation compared to FP-NLP and ANN-PSO, respectively.
- ANN-PSO has higher training efficiency compared to PPO as ANN-PSO converges with ≈ training examples while PPO converges to an optimal policy with ≈ training examples.
- Parity plots suggest ANN-PSO’s lower profits are due to PSO exploiting errors in the ANN, causing ANN-PSO to consistently overpredict profit and select suboptimal actions.
- Comparing PPO and ANN-PSO, PPO has better performance as shown by its higher profits and faster computational times. ANN-PSO has better applicability as shown by its higher training efficiency and capability to train on historical operational data.
2. Materials and Methods
2.1. Markov Decision Process
2.2. Artificial Neural Network with Particle Swarm Optimization (ANN-PSO)
2.3. Proximal Policy Optimization (PPO)
2.4. First Principles with Nonlinear Programming (FP-NLP)
3. Case Study
4. Evaluation
4.1. ANN-PSO Implementation
4.2. PPO Implementation
4.3. Benchmark Algorithms
4.4. Testing and Metrics
5. Results
5.1. PPO Achieves Higher Profits, but ANN-PSO Has Better Training Efficiency
5.2. Sensitivity Analysis
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Nomenclature
Acronyms | |
ANN | Artificial neural network |
ANN-PSO | Artificial neural network with particle swarm optimization |
CSTR | Continuously stirred tank reactor |
FP-NLP | First principles with nonlinear programming |
MDP | Markov decision process |
NLP | Nonlinear programming |
PPO | Proximal policy optimization |
PSO | Particle swarm optimization |
Temp SP | Temperature set point |
MDP symbols | |
Factor to discount future rewards | |
Real numbers | |
A | Set of possible actions in the environment |
Action at time t | |
m | Dimension of action space A |
n | Dimension of state space S |
Probability distribution for state | |
Probability distribution of rewards given and | |
Reward received at time t | |
S | Set of possible states in the environment |
State at time t | |
t | Time after environment episode starts |
ANN-PSO symbols | |
Expected value function of x | |
Number of times state and action sets are discretized for training data generation | |
Function which maps states to actions | |
Value of a state-action pair given policy | |
PPO symbols | |
Coefficient to limit policy updates in | |
Trajectory probability ratio between new and old policies | |
Generalized estimated advantage of parameters over | |
Factor for bias vs. variance trade-off | |
Trajectory starting from state | |
Vector of policy and value neural networks parameters | |
Coefficient to weight value function loss in | |
Coefficient to weight entropy bonus in | |
Final objective function to maximize , minimize mean, and encourage exploration | |
Clipped objective function to prevent excessively large updates | |
Mean squared error between and | |
Discounted reward of trajectory | |
Entropy bonus function for exploration encouragement | |
Time length of trajectories | |
Policy expected cumulative reward from time t to time starting from state given parameters | |
Expected episode cumulative discounted reward given policy with parameters | |
Observed cumulative reward from time t to the end of the episode | |
Case study symbols | |
Heat of reaction | |
Total Flow Rate | |
Flow rate of species i | |
Arrhenius constant | |
Density of all species | |
Relative standard deviation for stochastic parameters | |
c | Valve constant |
Concentration of species i | |
Specific heat capacity of all species | |
Feed concentration of species i | |
Cost per time | |
Reaction activation energy | |
h | Height of CSTR |
k | Reaction constant |
N | Number of times simulation is repeated for one set of prices and actions |
Price of commodity i | |
High price of commodity i | |
Low price of commodity i | |
q | Heat input rate |
R | Universal gas constant |
Reaction rate | |
T | Temperature of CSTR contents |
Feed temperature | |
Upper limit on CSTR temperature | |
V | Volume of CSTR |
Appendix A. Hyperparameter Tables
Hyperparameter | Value |
---|---|
Adam learning rate | |
Adam exponential decay rate for the first-moment estimates | 0.9 |
Adam exponential decay rate for the second-moment estimates | 0.999 |
Adam small number to prevent any division by zero | |
ANN layers | 1 |
ANN nodes | 64 |
ANN activation function | ReLU |
ANN L1 norm regularization coefficient | |
ANN Loss function | Mean squared error |
PSO cognitive parameter | |
PSO social parameter | |
PSO inertia parameter | 0.9 |
PSO swarm size | 100 |
PSO lower velocity bound | |
PSO upper velocity bound | 1 |
PSO iterations | 100 |
Hyperparameter | Value |
---|---|
Policy Value Shared Network layers | 2 Fully connected layers + 1 LSTM layer |
Policy Value Shared Network structure | [64,64,128] |
0 | |
n_steps | 100 |
ent_coef () | 0.01 |
learning_rate | |
vf_coef () | 0.5 |
max_grad_norm | 0.5 |
lam () | 0.95 |
nminibatches | 4 |
noptepochs | 4 |
cliprange () | 0.2 |
References
- Dotoli, M.; Fay, A.; Miskowicz, M.; Seatzu, C. A survey on advanced control approaches in factory automation. IFAC-PapersOnLine 2015, 28, 394–399. [Google Scholar] [CrossRef]
- Tuttle, J.F.; Vesel, R.; Alagarsamy, S.; Blackburn, L.D.; Powell, K. Sustainable NOx emission reduction at a coal-fired power station through the use of online neural network modeling and particle swarm optimization. Control. Eng. Pract. 2019, 93, 104167. [Google Scholar] [CrossRef]
- Petsagkourakis, P.; Sandoval, I.O.; Bradford, E.; Zhang, D.; del Rio-Chanona, E.A. Reinforcement learning for batch bioprocess optimization. Comput. Chem. Eng. 2020, 133, 106649. [Google Scholar] [CrossRef] [Green Version]
- Sheha, M.; Powell, K. Using Real-Time Electricity Prices to Leverage Electrical Energy Storage and Flexible Loads in a Smart Grid Environment Utilizing Machine Learning Techniques. Processes 2019, 7, 870. [Google Scholar] [CrossRef] [Green Version]
- Sheha, M.; Mohammadi, K.; Powell, K. Solving the Duck Curve in a Smart Grid Environment Using a Non-Cooperative Game Theory and Dynamic Pricing Profiles. Energy Convers. Manag. 2020, 220, 113102. [Google Scholar] [CrossRef]
- Sheha, M.; Mohammadi, K.; Powell, K. Techno-economic analysis of the impact of dynamic electricity prices on solar penetration in a smart grid environment with distributed energy storage. Appl. Energy 2021, 282, 116168. [Google Scholar] [CrossRef]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef] [Green Version]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
- Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with large scale deep reinforcement learning. arXiv 2019, arXiv:1912.06680. [Google Scholar]
- Argyros, I.K. Undergraduate Research at Cameron University on Iterative Procedures in Banach and Other Spaces; Nova Science Publishers: Hauppauge, NY, USA, 2019. [Google Scholar]
- Akkaya, I.; Andrychowicz, M.; Chociej, M.; Litwin, M.; McGrew, B.; Petron, A.; Paino, A.; Plappert, M.; Powell, G.; Ribas, R.; et al. Solving Rubik’s Cube with a Robot Hand. unpublished. Available online: http://xxx.lanl.gov/abs/1910.07113 (accessed on 20 March 2020).
- Buyukada, M. Co-combustion of peanut hull and coal blends: Artificial neural networks modeling, particle swarm optimization and Monte Carlo simulation. Bioresour. Technol. 2016, 216, 280–286. [Google Scholar] [CrossRef]
- Blackburn, L.D.; Tuttle, J.F.; Powell, K.M. Real-time optimization of multi-cell industrial evaporative cooling towers using machine learning and particle swarm optimization. J. Clean. Prod. 2020, 271, 122175. [Google Scholar] [CrossRef]
- Naserbegi, A.; Aghaie, M. Multi-objective optimization of hybrid nuclear power plant coupled with multiple effect distillation using gravitational search algorithm based on artificial neural network. Therm. Sci. Eng. Prog. 2020, 19, 100645. [Google Scholar] [CrossRef]
- Head, J.D.; Lee, K.Y. Using artificial neural networks to implement real-time optimized multi-objective power plant control in a multi-agent system. IFAC Proc. Vol. 2012, 8, 126–131. [Google Scholar] [CrossRef]
- Bhattacharya, S.; Dineshkumar, R.; Dhanarajan, G.; Sen, R.; Mishra, S. Improvement of ϵ-polylysine production by marine bacterium Bacillus licheniformis using artificial neural network modeling and particle swarm optimization technique. Biochem. Eng. J. 2017, 126, 8–15. [Google Scholar] [CrossRef]
- Khajeh, M.; Dastafkan, K. Removal of molybdenum using silver nanoparticles from water samples: Particle swarm optimization-artificial neural network. J. Ind. Eng. Chem. 2014, 20, 3014–3018. [Google Scholar] [CrossRef]
- Dhanarajan, G.; Mandal, M.; Sen, R. A combined artificial neural network modeling-particle swarm optimization strategy for improved production of marine bacterial lipopeptide from food waste. Biochem. Eng. J. 2014, 84, 59–65. [Google Scholar] [CrossRef]
- Ghaedi, M.; Ghaedi, A.M.; Ansari, A.; Mohammadi, F.; Vafaei, A. Artificial neural network and particle swarm optimization for removal of methyl orange by gold nanoparticles loaded on activated carbon and Tamarisk. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2014, 132, 639–654. [Google Scholar] [CrossRef]
- Khajeh, M.; Kaykhaii, M.; Hashemi, S.H.; Shakeri, M. Particle swarm optimization-artificial neural network modeling and optimization of leachable zinc from flour samples by miniaturized homogenous liquid-liquid microextraction. J. Food Compos. Anal. 2014, 33, 32–38. [Google Scholar] [CrossRef]
- Nezhadali, A.; Shadmehri, R.; Rajabzadeh, F.; Sadeghzadeh, S. Selective determination of closantel by artificial neural network- genetic algorithm optimized molecularly imprinted polypyrrole using UV–visible spectrophotometry. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2020, 243, 118779. [Google Scholar] [CrossRef]
- Abdullah, S.; Chandra Pradhan, R.; Pradhan, D.; Mishra, S. Modeling and optimization of pectinase-assisted low-temperature extraction of cashew apple juice using artificial neural network coupled with genetic algorithm. Food Chem. 2020, 127862. [Google Scholar] [CrossRef]
- Bagheri-Esfeh, H.; Safikhani, H.; Motahar, S. Multi-objective optimization of cooling and heating loads in residential buildings integrated with phase change materials using the artificial neural network and genetic algorithm. J. Energy Storage 2020, 32, 101772. [Google Scholar] [CrossRef]
- Ilbeigi, M.; Ghomeishi, M.; Dehghanbanadaki, A. Prediction and optimization of energy consumption in an office building using artificial neural network and a genetic algorithm. Sustain. Cities Soc. 2020, 61, 102325. [Google Scholar] [CrossRef]
- Solís-Pérez, J.E.; Gómez-Aguilar, J.F.; Hernández, J.A.; Escobar-Jiménez, R.F.; Viera-Martin, E.; Conde-Gutiérrez, R.A.; Cruz-Jacobo, U. Global optimization algorithms applied to solve a multi-variable inverse artificial neural network to improve the performance of an absorption heat transformer with energy recycling. Appl. Soft Comput. J. 2019, 85, 105801. [Google Scholar] [CrossRef]
- Filipe, J.; Bessa, R.J.; Reis, M.; Alves, R.; Póvoa, P. Data-driven predictive energy optimization in a wastewater pumping station. Appl. Energy 2019, 252, 113423. [Google Scholar] [CrossRef] [Green Version]
- Zhou, S.; Hu, Z.; Gu, W.; Jiang, M.; Chen, M.; Hong, Q.; Booth, C. Combined heat and power system intelligent economic dispatch: A deep reinforcement learning approach. Int. J. Electr. Power Energy Syst. 2020, 120, 106016. [Google Scholar] [CrossRef]
- Zhang, B.; Hu, W.; Cao, D.; Huang, Q.; Chen, Z.; Blaabjerg, F. Deep reinforcement learning-based approach for optimizing energy conversion in integrated electrical and heating system with renewable energy. Energy Convers. Manag. 2019, 202, 112199. [Google Scholar] [CrossRef]
- Rummukainen, H.; Nurminen, J.K. Practical reinforcement learning—Experiences in lot scheduling application. IFAC-PapersOnLine 2019, 52, 1415–1420. [Google Scholar] [CrossRef]
- Hofstetter, J.; Bauer, H.; Li, W.; Wachtmeister, G. Energy and Emission Management of Hybrid Electric Vehicles using Reinforcement Learning. IFAC-PapersOnLine 2019, 52, 19–24. [Google Scholar] [CrossRef]
- Philipsen, M.P.; Moeslund, T.B. Intelligent injection curing of bacon. Procedia Manuf. 2019, 38, 148–155. [Google Scholar] [CrossRef]
- Xiong, W.; Lu, Z.; Li, B.; Wu, Z.; Hang, B.; Wu, J.; Xuan, X. A self-adaptive approach to service deployment under mobile edge computing for autonomous driving. Eng. Appl. Artif. Intell. 2019, 81, 397–407. [Google Scholar] [CrossRef]
- Pi, C.H.; Hu, K.C.; Cheng, S.; Wu, I.C. Low-level autonomous control and tracking of quadrotor using reinforcement learning. Control. Eng. Pract. 2020, 95, 104222. [Google Scholar] [CrossRef]
- Machalek, D.; Quah, T.; Powell, K.M. Dynamic Economic Optimization of a Continuously Stirred Tank Reactor Using Reinforcement Learning. In Proceedings of the 2020 American Control Conference (ACC), Denver, CO, USA, 1–3 July 2020; pp. 2955–2960. [Google Scholar]
- Hubbs, C.D.; Li, C.; Sahinidis, N.V.; Grossmann, I.E.; Wassick, J.M. A deep reinforcement learning approach for chemical production scheduling. Comput. Chem. Eng. 2020, 141, 106982. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M.A. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Grossmann, I.E.; Apap, R.M.; Calfa, B.A.; García-Herreros, P.; Zhang, Q. Recent advances in mathematical programming techniques for the optimization of process systems under uncertainty. Comput. Chem. Eng. 2016, 91, 3–14. [Google Scholar] [CrossRef] [Green Version]
- Biegler, L.T.; Yang, X.; Fischer, G.A. Advances in sensitivity-based nonlinear model predictive control and dynamic real-time optimization. J. Process. Control 2015, 30, 104–116. [Google Scholar] [CrossRef] [Green Version]
- Diehl, M.; Bock, H.G.; Schlöder, J.P.; Findeisen, R.; Nagy, Z.; Allgöwer, F. Real-time optimization and nonlinear model predictive control of processes governed by differential-algebraic equations. J. Process. Control 2002, 12, 577–585. [Google Scholar] [CrossRef]
- Chollet, F.; Rahman, F.; Lee, T.; Marmiesse, G.; Zabluda, O.; Santana, E.; McColgan, T.; Snelgrove, X.; Branchaud-Charron, F.; Oliver, M.; et al. Keras. 2015. Available online: https://keras.io (accessed on 26 March 2020).
- Miranda, L. Pyswarms. 2017. Available online: https://github.com/ljvmiranda921/pyswarms (accessed on 13 April 2020).
- Hill, A.; Raffin, A.; Ernestus, M.; Gleave, A.; Kanervisto, A.; Traore, R.; Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; et al. Stable Baselines. 2018. Available online: https://github.com/hill-a/stable-baselines (accessed on 26 March 2020).
- Beal, L.; Hill, D.; Martin, R.; Hedengren, J. GEKKO Optimization Suite. Processes 2018, 6, 106. [Google Scholar] [CrossRef] [Green Version]
- Fujimoto, S.; Van Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. arXiv 2018, arXiv:1802.09477. [Google Scholar]
- Powell, K.M.; Machalek, D.; Quah, T. Real-Time Optimization using Reinforcement Learning. Comput. Chem. Eng. 2020, 143, 107077. [Google Scholar] [CrossRef]
Parameter | Value | Unit |
---|---|---|
c | 1.2 | |
h | 8 | |
0.23 | ||
5000 | ||
R | 8.314 | |
780 | ||
3.25 | ||
15 | ||
15 | ||
500 | ||
800 | ||
V | 100.53 | |
156 | ||
2 | ||
20 | ||
5 | ||
50 | ||
2 | ||
20 | ||
0 | ||
0.1 | ||
N | 120 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Quah, T.; Machalek, D.; Powell, K.M. Comparing Reinforcement Learning Methods for Real-Time Optimization of a Chemical Process. Processes 2020, 8, 1497. https://doi.org/10.3390/pr8111497
Quah T, Machalek D, Powell KM. Comparing Reinforcement Learning Methods for Real-Time Optimization of a Chemical Process. Processes. 2020; 8(11):1497. https://doi.org/10.3390/pr8111497
Chicago/Turabian StyleQuah, Titus, Derek Machalek, and Kody M. Powell. 2020. "Comparing Reinforcement Learning Methods for Real-Time Optimization of a Chemical Process" Processes 8, no. 11: 1497. https://doi.org/10.3390/pr8111497
APA StyleQuah, T., Machalek, D., & Powell, K. M. (2020). Comparing Reinforcement Learning Methods for Real-Time Optimization of a Chemical Process. Processes, 8(11), 1497. https://doi.org/10.3390/pr8111497