Reinforcement Learning-Based Hybrid Multi-Objective Optimization Algorithm Design
Abstract
:1. Introduction
- complicated when there is a non-trivial number (usually more than two) of known cause–effect relations;
- complex when there are at least some relevant cause–effect relations that are not subject to existing experience, or that are not accessible in closed analytical form,
2. RLhybMOO Process Structure and Algorithm
- The Action Space A provides the set of available actions to the RLhybMOO control policy as its base of selectable action alternatives. Elements within the Action Space A are predefined by the user. Each element within the Action Space represents an elementary MOO sampling strategy that is applicable to the existing problem. Each individual sampling algorithm (also called the infill algorithm) yields a dedicated number of sampling points within the design space X (domain) proposed for the next round of (computationally expensive) black-box evaluations to yield a result vector within the target space Y (co-domain). Usually, the logic of a sampling strategy is based on previous function evaluations provided by the MOO Problem Environment.
Algorithm 1 RLhybMOO policy training 1: ▹ Define Action set A 2: ▹ Define State function 3: ▹ Define Reward function 4: ▹ Assign Policy Algorithm base 5: while do ▹ Repeat for episodes 6: ▹ Set initial DoE 7: ▹ Simulate at (expensive) 8: ▹ Initial Environment State 9: while do ▹ Until abortion criterion 10: ▹ Select Policy-based action 11: ▹-based sample definition 12: ▹ Simulate at (expensive) 13: ▹ Extend sample set in domain 14: ▹ … and in co-Domain 15: ▹ Calculate new state 16: ▹ Calculate reward 17: ▹ Increase time step 18: ▹ Update sampling counter 19: ▹ Define experience 20: ▹ Update experience buffer 21: end while 22: ▹ Update episode counter 23: ▹ Update Policy by experience buffer 24: end while 25: ▹ Define RLhybMOO Policy - MOO Problem Environment: Black box function evaluation tasks are answered by the (simulation-capable) model of the MOO Problem Environment. In addition, it provides descriptive state features at each time step t, as required for learning the hybrid MOO algorithm control policy. The variables employed to characterize an environment state, referred to as state features, must be observable in a quantifying way, and representative for describing the overall achieved quality of solving the MOO problem. To evaluate the value-adding contribution of the k function evaluations as triggered by an individual action applied at a state s at time t, a suitable reward function is provided. Overall, the MOO Problem Environment provides the k function evaluations requested by action at state , analyzes their value-add by the reward , and calculates the consecutive state of the progressing optimization process.Starting the training or evaluation process requires the definition of several optimization and process-related parameters, specifically the total number of available samples , the number of initial samples , and the number of episodes to train the policy. In the first step (episode) of each optimization process, an initial Latin Hyper Cube Sampling (LHCS) with intial training points is executed to receive a first information set () and to calculate the initial state . Subsequent states of the MOO Problem Environment are handed over to the RLhybMOO policy. In each increasing time step t, the number of evaluated samples is monitored. An individual training phase, and thereby, an episode, ends at the particular time-step when the number of evaluated sample points equals or exceeds the number of available sample points .
- Control Policy: Finally, the desired RLhybMOO control policy , resulting from the function block of the same name represents a probability density for any action within the action space A as a function of the internal state , including a decision scheme for selecting one of the available actions. The thereby chosen sampling strategy triggers a search at k design points.
- RL Algorithm: To start the search for the RLhybMOO control policy, a problem-adequate RL algorithm family must be defined. It phrases the mathematical nature of the policy and its base initialization . During the training phase (indicated by dashed lines in Figure 1), the selected RL algorithm is fed by individual learning “experiences” , as gathered during an iterative episodic training process. The set of these experiences forms the experience memory D as a learning base for the control policy , repeatedly updated after the completion of every episodic training iteration until the final number of allowed training episodes is reached.
3. Experimental Setup
3.1. State Description
- A solution quality increase with respect to initial sampling,
- A solution quality increase by the last taken action,
- A stagnation counter,
- The number of identified Pareto-optimal solutions,
- The percentage of left over vs. totally available samples.
3.2. Reward Function
3.3. Action Space
- : Target Value Strategy (ParetoFill),
- : Pareto Points Random Perturbation,
- : Minimum Point Sampling (Single Optimization),
- : Uniform Random Points over the Design Space,
- : Surrogate MOO Genetic Algorithm (NSGA-II).
3.4. Policy Adaptation (RL Algorithm)
3.5. Parameter Settings
4. Results
5. Discussion
6. Conclusions and Outlook
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Snowden, D.J.; Boone, M.E. A leader’s framework for decision making. Harv. Bus. Rev. 2007, 85, 68. [Google Scholar] [PubMed]
- Renn, O.; Klinke, A.; Van Asselt, M. Coping with complexity, uncertainty and ambiguity in risk governance: A synthesis. Ambio 2011, 40, 231–246. [Google Scholar] [CrossRef] [PubMed]
- Hwang, C.L.; Masud, A.S.M. Multiple Objective Decision Making—Methods and Applications: A State-of-the-Art Survey; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 164. [Google Scholar]
- Roijers, D.M.; Whiteson, S. Multi-objective decision making. Synth. Lect. Artif. Intell. Mach. Learn. 2017, 11, 1–129. [Google Scholar]
- Aruldoss, M.; Lakshmi, T.M.; Venkatesan, V.P. A survey on multi criteria decision making methods and its applications. Am. J. Inf. Syst. 2013, 1, 31–43. [Google Scholar]
- Deb, K. Multi-objective optimization. In Search Methodologies; Springer: Berlin/Heidelberg, Germany, 2014; pp. 403–449. [Google Scholar]
- Deb, K. Multi-Objective Evolutionary Algorithms. In Springer Handbook of Computational Intelligence; Kacprzyk, J., Pedrycz, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2015; pp. 995–1015. [Google Scholar] [CrossRef]
- Sheng, L.; Ibrahim, Z.; Buyamin, S.; Ahmad, A.; Mohd Tumari, M.Z.; Mat Jusof, M.F.; Aziz, N. Multi-Objective Particle Swarm Optimization Algorithms—A Leader Selection Overview. Int. J. Simul. Syst. Sci. Technol. 2014, 15, 6–19. [Google Scholar] [CrossRef]
- Alaya, I.; Solnon, C.; Ghedira, K. Ant Colony Optimization for Multi-Objective Optimization Problems. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), Patras, Greece, 29–31 October 2007; Volume 1, pp. 450–457. [Google Scholar] [CrossRef]
- Murata, T.; Ishibuchi, H. MOGA: Multi-objective genetic algorithms. In Proceedings of the IEEE International Conference on Evolutionary Computation, Perth, WA, Australia, 29 November–1 December 1995; Volume 1, pp. 289–294. [Google Scholar]
- Hale, J.Q.; Zhou, E. A model-based approach to multi-objective optimization. In Proceedings of the 2015 Winter Simulation Conference (WSC), Huntington Beach, CA, USA, 6–9 December 2015; pp. 3599–3609. [Google Scholar] [CrossRef]
- Jiang, P.; Zhou, Q.; Shao, X. Surrogate Model-Based Engineering Design and Optimization; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
- Betrò, B.; Schoen, F. A stochastic technique for global optimization. Comput. Math. Appl. 1991, 21, 127–133. [Google Scholar] [CrossRef]
- Emmerich, M.; Yang, K.; Deutz, A.; Wang, H.; Fonseca, C. Multicriteria Generalization of Bayesian Global Optimization. In Advances in Stochastic and Deterministic Global Optimization; Pardalos, P.M., Zhigljavsky, A., Žilinskas, J., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Chapter A. [Google Scholar]
- Emmerich, M.; Klinkenberg, J.W. The computation of the expected improvement in dominated hypervolume of Pareto front approximations. Rapp. Tech. Leiden Univ. 2008, 34, 3–7. [Google Scholar]
- Sindhya, K.; Miettinen, K.; Deb, K. A hybrid framework for evolutionary multi-objective optimization. IEEE Trans. Evol. Comput. 2012, 17, 495–511. [Google Scholar] [CrossRef]
- Müller, J. SOCEMO: Surrogate Optimization of Computationally Expensive Multiobjective Problems. INFORMS J. Comput. 2017, 29, 581–596. [Google Scholar] [CrossRef]
- Gosavi, A. Reinforcement learning: A tutorial survey and recent advances. INFORMS J. Comput. 2009, 21, 178–192. [Google Scholar] [CrossRef]
- Sievers, S.; Katz, M.; Sohrabi, S.; Samulowitz, H.; Ferber, P. Deep Learning for Cost-Optimal Planning: Task-Dependent Planner Selection. Proc. Aaai Conf. Artif. Intell. 2019, 33, 7715–7723. [Google Scholar] [CrossRef]
- Röger, G.; Helmert, M. The More, the Merrier: Combining Heuristic Estimators for Satisficing Planning. Proc. Int. Conf. Autom. Plan. Sched. 2010, 20, 246–249. [Google Scholar] [CrossRef]
- Speck, D.; Biedenkapp, A.; Hutter, F.; Mattmüller, R.; Lindauer, M. Learning Heuristic Selection with Dynamic Algorithm Configuration. In Proceedings of the International Conference on Automated Planning and Scheduling, Nancy, France, 14–19 June 2020. [Google Scholar] [CrossRef]
- Biedenkapp, A.; Bozkurt, H.F.; Eimer, T.; Hutter, F.; Lindauer, M.T. Dynamic Algorithm Configuration: Foundation of a New Meta-Algorithmic Framework. In Proceedings of the ECAI, Santiago de Compostela, Spain, 29 August–8 September 2020; IOS Press: Amsterdam, The Netherlands, 2020. [Google Scholar]
- Adriaensen, S.; Biedenkapp, A.; Shala, G.; Awad, N.H.; Eimer, T.; Lindauer, M.T.; Hutter, F. Automated Dynamic Algorithm Configuration. arXiv 2022, arXiv:2205.13881. [Google Scholar] [CrossRef]
- Zitzler, E.; Deb, K.; Thiele, L. Comparison of Multiobjective Evolutionary Algorithms: Empirical Results. Evol. Comput. 2000, 8, 173–195. [Google Scholar] [CrossRef] [PubMed]
- Bader, J.; Zitzler, E. A Hypervolume-Based Optimizer for High-Dimensional Objective Spaces. In New Developments in Multiple Objective and Goal Programming; Springer: Berlin/Heidelberg, Germany, 2010; Volume 638, pp. 35–54. [Google Scholar] [CrossRef]
- Palm, N.; Landerer, M.; Palm, H. Gaussian Process Regression Based Multi-Objective Bayesian Optimization for Power System Design. Sustainability 2022, 14, 12777. [Google Scholar] [CrossRef]
- Karafotias, G.; Eiben, A.E.; Hoogendoorn, M. Generic Parameter Control with Reinforcement Learning. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Lille, France, 10–14 July 2021; Association for Computing Machinery: New York, NY, USA, 2014; pp. 1319–1326. [Google Scholar] [CrossRef]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning, Jinan, China, 19–21 May 2018. [Google Scholar]
- Christodoulou, P. Soft Actor-Critic for Discrete Action Settings. arXiv 2019, arXiv:1910.07207. [Google Scholar] [CrossRef]
- Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
- Mckay, M.; Beckman, R.; Conover, W. A Comparison of Three Methods for Selecting Vales of Input Variables in the Analysis of Output From a Computer Code. Technometrics 1979, 21, 239–245. [Google Scholar] [CrossRef]
Test Function | SOCEMO | RLhyb MOO | Target Value | Perturbation | Single | Uniform | NSGA II | LHC |
---|---|---|---|---|---|---|---|---|
ZDT1 | 94.53% | 96.49% | 86.40% | 80.2% | 88.2% | 82.0% | 86.89% | 82.16% |
ZDT2 | 96.23% | 96.30% | 87.5% | 71.92% | 85.86% | 74.5% | 79.18% | 73.03% |
ZDT3 | 85.67% | 89.23% | 87.44% | 72.5% | 85.84% | 75.11% | 75.89% | 76.26% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Palm, H.; Arndt, L. Reinforcement Learning-Based Hybrid Multi-Objective Optimization Algorithm Design. Information 2023, 14, 299. https://doi.org/10.3390/info14050299
Palm H, Arndt L. Reinforcement Learning-Based Hybrid Multi-Objective Optimization Algorithm Design. Information. 2023; 14(5):299. https://doi.org/10.3390/info14050299
Chicago/Turabian StylePalm, Herbert, and Lorin Arndt. 2023. "Reinforcement Learning-Based Hybrid Multi-Objective Optimization Algorithm Design" Information 14, no. 5: 299. https://doi.org/10.3390/info14050299
APA StylePalm, H., & Arndt, L. (2023). Reinforcement Learning-Based Hybrid Multi-Objective Optimization Algorithm Design. Information, 14(5), 299. https://doi.org/10.3390/info14050299