Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation
Abstract
:1. Introduction
2. Background
2.1. Particle Swarm Optimisation
2.2. Control Parameter Configurations
2.3. Convergence Condition
2.4. Velocity Clamping
2.5. Self-Adaptive Particle Swarm Optimisation
2.5.1. Adaptive Particle Swarm Optimisation Based on Velocity Information
2.5.2. Adventurous Unified Adaptive Particle Swarm Optimisation
2.6. Reinforcement Learning
Soft Actor-Critic
2.7. Reinforcement Learning as a Hyper-Heuristic
2.7.1. Frequency Improvement Reinforcement Learning Selection
2.7.2. Objective Function Proportional Reinforcement Learning Selection
3. Design of Soft Actor-Critic Self-Adaptive Mechanism
3.1. Motivation for Reinforcement Learning for Self-Adaptation
- UAPSO-A does not consider particle characteristics such as velocity or position to inform CP updates.
- Orthogonal automata are used for each CP, which precludes the decision-making mechanism from considering the other CP values when updating the current CP.
- The UAPSO-A mechanism makes use of simplistic probability-based decision-making, which assumes previously successful CP values to be successful in the future. This assumption is not necessarily true, given deceptive optimisation problems with changing function landscape characteristics.
- UAPSO-A can only select CP values from a discrete set, disallowing fine-grained adjustment.
- Labelling vs. Interacting: In order to train the NN, a labelled dataset is required. Since the NN has to predict the CPs, supervised learning requires knowledge of the “correct” CPs in order for the NN to train, which is, of course, not known in advance. RL bypasses this issue, as the agent interacts with the environment autonomously, and learns from experience. In this case, the agent is the SAC, and the environment is the PSO search process, which provides information such as particle velocities and solutions to the agent.
- Exploration vs. Exploitation: Just as the particle swarm should ideally maintain a suitable balance between exploration and exploitation, this balance should also be present in the CP space. NNs are not capable of exploring the CP space, but merely learn from labels. Conversely, the exploration-exploitation paradigm lies at the heart of the SAC formulation, as discussed in Section 2.6. The SAC agent explores the CP space by taking random actions, and exploits the CP space by taking actions which have been shown to be successful in the past.
3.2. Reinforcement Learning Algorithm Selection
- Model-free, not model-based: Since there is no preexisting model of how the CP configuration should be adapted (if there was, the problem would be solved already), the agent has to learn in a model-free way.
- Off-policy, not on-policy: Exploration of the CP space is aided if the agent is permitted to take actions off-policy. If the agent only takes actions on-policy, the only possible actions are those which have been successful in the past, increasing the likelihood of becoming stuck in a local optimum.
- Continuous action space, not discrete: The possible CP values which can be selected form a continuous range and the policy should, therefore, have a continuous action space. Contrast this with the UAPSO-A algorithm in Section 2.5.2, which outputs a range of discrete values. Discretisation adds complexity because the agent chooses between n different values on every CP update. Nonetheless, it has less flexibility than a continuous action space because it is still limited to those n specific values.
- Continuous state space, not discrete: All variables in the PSO formulation (i.e., particle positions, velocities, etc.) are continuous floating-point values. Since these are the variables which the agent observes, the agent should have a continuous state (observation) space.
3.3. SAC-SAPSO Architecture
Algorithm 1: SAC-SAPSO Algorithm |
3.3.1. PSO as the Environment
3.3.2. SAC as the Agent
3.3.3. SAC-SAPSO Reward Signal
4. Experimental Procedure
4.1. Evaluation Metrics
- Normalised global best solution quality. The objective function value, or solution quality at the global best positions, signifies how well an algorithm solves a given optimisation problem. Because functions differ in their ranges, normalised global best solution quality is employed. Once all experiments have concluded, the highest and lowest global best solutions discovered throughout all experiments and runs are used to normalise global best solutions to [0, 1] for each experiment. The scale used for normalisation is chosen at will, but unitary scaling leads to easily comprehensible results, where lower values represent better solutions.
- Average swarm diversity offers insights into the balance of exploration and exploitation. Diversity is computed using [50]
- Percentage of particles in infeasible space. A particle is considered to be in infeasible space if it breaches the boundaries of the feasible search space in at least one dimension. Infeasible particles should not be taken into account when updating the best positions found, so as not to steer the search away from feasible space [9].
- Percentage of particles that are stable. Poli’s stability condition (given in Section 2.3 in Equation (4)) determines whether a particle has convergent CP configurations, and is considered stable [51].
- Average particle velocity denotes the average step sizes. In order for the search process to converge, step sizes have to decrease. However, in order to not become trapped in a local minimum, step sizes should not become too small too quickly. The average particle velocity is calculated using [7]
4.2. Implementation
4.3. SAC Hyperparameters
5. Results
5.1. Performance Baselines
5.2. Soft Actor-Critic Control Parameter Adaptation
5.3. Soft Actor-Critic Control Parameter Adaptation with Velocity Clamping
5.4. Overview of Results
6. Conclusions
7. Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
CP | Control Parameter |
PSO | Particle Swarm Optimisation |
Appendix A. Benchmark Functions
- Ackley 1 for :
- Alpine 1 for :
- Bohachevsky 1 for :
- Bonyadi Michalewicz for :
- Brown for :
- Cosine Mixture for :
- Cross Leg Table for :
- Deflected Corrugated Spring for , , :
- Discuss for :
- Drop Wave for :
- Egg Crate for :
- Egg Holder for :
- Elliptic for :
- Exponential for :
- Giunta for :
- Holder Table 1 for :
- Lanczos 3 for :
- Levy 3 for :
- Levy-Montalvo 2 for :
- Michalewicz for :
- Mishra 1 for :
- Mishra 4 for :
- Needle Eye for and :
- Norwegian for :
- Pathological for :
- Penalty 1 for :
- Penalty 2 for :
- Periodic for :
- Pinter for :
- Price 2 for :
- Qings for :
- Quadric for :
- Quintic for :
- Rana for :
- Rastrigin for
- Ripple 25 for :
- Rosenbrock for :
- Salomon for :
- Schaffer 4 for :
- Schubert 4 for :
- Schwefel 1 for and :
- Sine Envelope for :
- Sinusoidal for :
- Step Function 3 for :
- Stretched V Sine Wave for :
- Trid for :
- Trigonometric for :
- Wavy for :
- Weierstrass for :
- Vincent for :
- Xin-She Yang 1 for :
- Xin-She Yang 2 for :
Function | Equation | Cont. | Diff. | Sep. | Mod. |
---|---|---|---|---|---|
Ackley 1 | (A1) | C | D | NS | MM |
Alpine 1 | (A2) | C | ND | S | MM |
Bohachevsky1 | (A3) | C | D | S | MM |
Bonyadi-Michalewicz | (A4) | C | D | NS | MM |
Brown | (A5) | C | D | NS | UM |
Cosine Mixture | (A6) | C | D | S | MM |
Deflected Corrugated Spring | (A8) | C | D | S | MM |
Discuss | (A9) | C | D | S | UM |
Drop Wave | (A10) | C | D | NS | MM |
Egg crate | (A11) | C | D | S | MM |
Egg Holder | (A12) | C | ND | NS | MM |
Elliptic | (A13) | C | D | S | UM |
Exponential | (A14) | C | D | NS | UM |
Giunta | (A15) | C | D | S | MM |
Holder Table 1 | (A16) | C | ND | NS | MM |
Levy 3 | (A18) | C | D | NS | MM |
Levy–Montalvo 2 | (A19) | C | D | NS | MM |
Mishra 1 | (A21) | C | D | NS | MM |
Mishra 4 | (A22) | C | ND | NS | MM |
Needle Eye | (A23) | C | ND | S | MM |
Norwegian | (A24) | C | D | NS | MM |
Pathological | (A25) | C | D | NS | MM |
Penalty 1 | (A26) | C | D | NS | MM |
Penalty 2 | (A27) | C | D | NS | MM |
Periodic | (A28) | C | D | S | MM |
Pinter 2 | (A29) | C | D | NS | MM |
Price 2 | (A30) | C | D | NS | MM |
Qings | (A31) | C | D | S | MM |
Quadric | (A32) | C | D | NS | UM |
Quintic | (A33) | C | ND | S | MM |
Rana | (A34) | C | ND | NS | MM |
Rastrigin | (A35) | C | D | S | MM |
Ripple 25 | (A36) | C | D | S | MM |
Rosenbrock | (A37) | C | D | NS | MM |
Salomon | (A38) | C | D | NS | MM |
Schubert 4 | (A40) | C | D | S | MM |
Schwefel 1 | (A41) | C | D | S | UM |
Sinusoidal | (A43) | C | D | NS | MM |
Step Function 3 | (A44) | NC | ND | S | MM |
Trid | (A46) | C | D | NS | MM |
Trigonometric | (A47) | C | D | NS | MM |
Vincent | (A50) | C | D | S | MM |
Weierstrass | (A49) | C | D | S | MM |
Xin-She Yang 1 | (A51) | NC | ND | S | MM |
Xin-She Yang 2 | (A52) | C | ND | NS | MM |
Cross Leg Table | (A7) | NC | ND | NS | MM |
Lanczos 3 | (A17) | C | D | NS | MM |
Michalewicz | (A20) | C | D | S | MM |
Schaffer 4 | (A39) | C | D | NS | MM |
Sine Envelope | (A42) | C | D | NS | MM |
Stretched V Sine Wave | (A45) | C | D | NS | MM |
Wavy | (A48) | C | D | S | UM |
References
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Beielstein, T.; Parsopoulos, K.E.; Vrahatis, M.N. Tuning PSO Parameters Through Sensitivity Analysis; Technical Report Interner Bericht des Sonderforschungsbereichs (SFB) 531 Computational Intelligence No.CI-124/02; Universitätsbibliothek Dortmund: Dortmund, Germany, 2002. [Google Scholar]
- Van den Bergh, F.; Engelbrecht, A.P. A study of particle swarm optimization particle trajectories. Inf. Sci. 2006, 176, 937–971. [Google Scholar] [CrossRef]
- Bonyadi, M.R.; Michalewicz, Z. Impacts of coefficients on movement patterns in the particle swarm optimization algorithm. IEEE Trans. Evol. Comput. 2016, 21, 378–390. [Google Scholar] [CrossRef]
- Bratton, D.; Kennedy, J. Defining a standard for particle swarm optimization. In Proceedings of the IEEE Swarm Intelligence Symposium, Honolulu, HI, USA, 1–5 April 2007; pp. 120–127. [Google Scholar]
- Jiang, M.; Luo, Y.; Yang, S. Stochastic convergence analysis and parameter selection of the standard particle swarm optimization algorithm. Inf. Process. Lett. 2007, 102, 8–16. [Google Scholar] [CrossRef]
- Harrison, K.; Engelbrecht, A.P.; Ombuki-Berman, B. Self-adaptive particle swarm optimization: A review and analysis of convergence. Swarm Intell. 2018, 12, 187–226. [Google Scholar] [CrossRef]
- Harrison, K.; Engelbrecht, A.; Ombuki-Berman, B. Inertia Control Strategies for Particle Swarm Optimization: Too Much Momentum, Not Enough Analysis. Swarm Intell. 2016, 10, 267–305. [Google Scholar] [CrossRef]
- Engelbrecht, A.P. Roaming Behavior of Unconstrained Particles. In Proceedings of the BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence, Ipojuca, Brazil, 8–11 September 2013; pp. 104–111. [Google Scholar]
- Harrison, K.R.; Engelbrecht, A.P.; Ombuki-Berman, B.M. Optimal parameter regions and the time-dependence of control parameter values for the particle swarm optimization algorithm. Swarm Evol. Comput. 2018, 41, 20–35. [Google Scholar] [CrossRef]
- Ratnaweera, A.; Halgamuge, S.; Watson, H. Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients. IEEE Trans. Evol. Comput. 2004, 8, 240–255. [Google Scholar] [CrossRef]
- Leonard, B.J.; Engelbrecht, A.P. On the optimality of particle swarm parameters in dynamic environments. In Proceedings of the IEEE Congress on Evolutionary Computation, Cancun, Mexico, 20–23 June 2013; pp. 1564–1569. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Barsce, J.C.; Palombarini, J.A.; Martínez, E.C. Automatic tuning of hyper-parameters of reinforcement learning algorithms using Bayesian optimization with behavioral cloning. arXiv 2021, arXiv:2112.08094. [Google Scholar]
- Talaat, F.M.; Gamel, S.A. RL based hyper-parameters optimization algorithm (ROA) for convolutional neural network. J. Ambient Intell. Humaniz. Comput. 2022, 13, 3389–3402. [Google Scholar] [CrossRef]
- Liu, X.; Wu, J.; Chen, S. Efficient hyperparameters optimization through model-based reinforcement learning with experience exploiting and meta-learning. Soft Comput. 2023, 27, 7051–7066. [Google Scholar] [CrossRef]
- Wauters, T.; Verbeeck, K.; De Causmaecker, P.; Vanden Berghe, G. Boosting Metaheuristic Search Using Reinforcement Learning. In Hybrid Metaheuristics; Talbi, E.G., Ed.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 433–452. [Google Scholar]
- Shi, Y.; Eberhart, R. A modified particle swarm optimizer. In Proceedings of the IEEE International Conference on Evolutionary Computation, Anchorage, AK, USA, 4–9 May 1998; Volume 6, pp. 69–73. [Google Scholar]
- Clerc, M.; Kennedy, J. The Particle Swarm-Explosion, Stability, and Convergence in a Multidimensional Complex Space. IEEE Trans. Evol. Comput. 2002, 6, 58–73. [Google Scholar] [CrossRef]
- Sermpinis, G.; Theofilatos, K.; Karathanasopoulos, A.; Georgopoulos, E.F.; Dunis, C. Forecasting foreign exchange rates with adaptive neural networks using radial-basis functions and Particle Swarm Optimization. Eur. J. Oper. Res. 2013, 225, 528–540. [Google Scholar] [CrossRef]
- Poli, R. Mean and Variance of the Sampling Distribution of Particle Swarm Optimizers During Stagnation. IEEE Trans. Evol. Comput. 2009, 13, 712–721. [Google Scholar] [CrossRef]
- Poli, R.; Broomhead, D. Exact Analysis of the Sampling Distribution for the Canonical Particle Swarm Optimiser and Its Convergence During Stagnation. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, London, UK, 7–11 July 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 134–141. [Google Scholar]
- von Eschwege, D.; Engelbrecht, A. A Cautionary Note on Poli’s Stability Condition for Particle Swarm Optimization. In Proceedings of the IEEE Swarm Intelligence Symposium, Mexico City, Mexico, 5–8 December 2023. [Google Scholar]
- Oldewage, E.T.; Engelbrecht, A.P.; Cleghorn, C.W. The merits of velocity clamping particle swarm optimisation in high dimensional spaces. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar]
- Li, X.; Fu, H.; Zhang, C. A Self-Adaptive Particle Swarm Optimization Algorithm. In Proceedings of the International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 5, pp. 186–189. [Google Scholar]
- Dong, C.; Wang, G.; Chen, Z.; Yu, Z. A Method of Self-Adaptive Inertia Weight for PSO. In Proceedings of the International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 1, pp. 1195–1198. [Google Scholar]
- Xu, G. An Adaptive Parameter Tuning of Particle Swarm Optimization Algorithm. Appl. Math. Comput. 2013, 219, 4560–4569. [Google Scholar] [CrossRef]
- Hashemi, A.; Meybodi, M. A note on the learning automata based algorithms for adaptive parameter selection in PSO. Appl. Soft Comput. 2011, 11, 689–705. [Google Scholar] [CrossRef]
- Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement Learning with Deep Energy-Based Policies. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1352–1361. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd National Conference on Artificial Intelligence, Washington, DC, USA, 2008; Volume 3, pp. 1433–1438. [Google Scholar]
- Maei, H.R.; Szepesvári, C.; Bhatnagar, S.; Precup, D.; Silver, D.; Sutton, R.S. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. In Proceedings of the 22nd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2009; pp. 1204–1212. [Google Scholar]
- Kullback, S. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Van der Stockt, S.A.; Engelbrecht, A.P. Analysis of selection hyper-heuristics for population-based meta-heuristics in real-valued dynamic optimization. Swarm Evol. Comput. 2018, 43, 127–146. [Google Scholar] [CrossRef]
- Grobler, J.; Engelbrecht, A.P.; Kendall, G.; Yadavalli, V.S.S. Alternative hyper-heuristic strategies for multi-method global optimization. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
- Grobler, J.; Engelbrecht, A.P.; Kendall, G.; Yadavalli, V. Multi-method algorithms: Investigating the entity-to-algorithm allocation problem. In Proceedings of the IEEE Congress on Evolutionary Computation, Cancun, Mexico, 20–23 June 2013; pp. 570–577. [Google Scholar]
- Grobler, J.; Engelbrecht, A.P.; Kendall, G.; Yadavalli, V. Heuristic space diversity control for improved meta-hyper-heuristic performance. Inf. Sci. 2015, 300, 49–62. [Google Scholar] [CrossRef]
- Nareyek, A. Choosing Search Heuristics by Non-Stationary Reinforcement Learning. In Metaheuristics: Computer Decision-Making; Springer: Boston, MA, USA, 2004; pp. 523–544. [Google Scholar]
- Burke, E.K.; Kendall, G.; Soubeiga, E. A Tabu-Search Hyperheuristic for Timetabling and Rostering. J. Heuristics 2003, 9, 451–470. [Google Scholar] [CrossRef]
- Wirth, C.; Fürnkranz, J. EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning. In Proceedings of the 5th Asian Conference on Machine Learning, Canberra, Australia, 13–15 November 2013; Volume 29, pp. 483–497. [Google Scholar]
- Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Technical Report CUED/F-INFENG/TR 166; Department of Engineering, University of Cambridge: Cambridge, UK, 1994. [Google Scholar]
- Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML’16, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 1928–1937. [Google Scholar]
- Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 31st International Conference on Machine Learning, Lille, France, 11 July 2015. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1582–1591. [Google Scholar]
- Olorunda, O.; Engelbrecht, A.P. Measuring exploration/exploitation in particle swarms using swarm diversity. In Proceedings of the IEEE Congress on Evolutionary Computation, Hong Kong, China, 1–6 June 2008; pp. 1128–1134. [Google Scholar]
- Cleghorn, C.W.; Engelbrecht, A. Particle swarm optimizer: The impact of unstable particles on performance. In Proceedings of the IEEE Swarm Intelligence Symposium, Athens, Greece, 6–9 December 2016; pp. 1–7. [Google Scholar]
- Goodfellow, I.J.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Engelbrecht, A. Stability-Guided Particle Swarm Optimization. In Swarm Intelligence (ANTS); Springer: Cham, Switzerland, 2022. [Google Scholar]
- Harrison, K.R.; Engelbrecht, A.P.; Ombuki-Berman, B.M. An adaptive particle swarm optimization algorithm based on optimal parameter regions. In Proceedings of the IEEE Swarm Intelligence Symposium, Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar]
Algorithm | Definition | Action | State | Model | Policy |
---|---|---|---|---|---|
MC [41] | Every-Visit Monte Carlo | Discr. | Discr. | Model free | Either |
SARSA [42] | State-action-reward-state-action | Discr. | Discr. | Model free | On-policy |
Q-learning [43] | State-action-reward-state | Discr. | Discr. | Model free | Off-policy |
DQN [44] | Deep Q-Network | Discr. | Cont. | Model free | Off-policy |
A3C [45] | Asynchronous Advantage Actor-Critic | Cont. | Cont. | Model free | On-policy |
TRPO [46] | Trust Region Policy Optimisation | Cont. | Cont. | Model free | On-policy |
PPO [47] | Proximal Policy Optimisation | Cont. | Cont. | Model free | On-policy |
DDPG [48] | Deep Deterministic Policy Gradient | Cont. | Cont. | Model free | Off-policy |
TD3 [49] | Twin Delayed Deep Deterministic Policy Gradient | Cont. | Cont. | Model free | Off-policy |
SAC [14] | Soft Actor-Critic | Cont. | Cont. | Model free | Off-policy |
Discount factor | 1 | Target smoothing coefficient | 0.005 |
Reward scale factor | 1 | Learning rate | 0.0001 |
Actor layer size | 256 | Replay buffer size | |
Critic layer size | 256 | Training steps |
Training | Testing | |||
---|---|---|---|---|
baseline_tvac | 0.0883 | 0.1894 | 0.2926 | 0.2037 |
baseline_timevariant | 0.0888 | 0.1991 | 0.3287 | 0.3052 |
baseline_constant | 0.0893 | 0.1856 | 0.3802 | 0.2822 |
baseline_random | 0.1044 | 0.2032 | 0.3812 | 0.3125 |
Training | Testing | |||
---|---|---|---|---|
sac_ 10 | 0.0589 | 0.1338 | 0.2585 | 0.2602 |
sac_ 125 | 0.0628 | 0.1299 | 0.1811 | 0.192 |
sac_ 25 | 0.0687 | 0.1475 | 0.2415 | 0.1712 |
sac_ 100 | 0.0693 | 0.1599 | 0.2363 | 0.1905 |
sac_ auto | 0.0693 | 0.1529 | 0.2373 | 0.1675 |
sac_ 50 | 0.0737 | 0.1609 | 0.2342 | 0.1998 |
Training | Testing | |||
---|---|---|---|---|
vc_sac_ 125 | 0.0428 | 0.1115 | 0.0919 | 0.0892 |
vc_sac_ 50 | 0.0431 | 0.103 | 0.0689 | 0.0575 |
vc_sac_ 100 | 0.0473 | 0.1143 | 0.1089 | 0.0828 |
vc_sac_ auto | 0.049 | 0.109 | 0.228 | 0.1628 |
vc_sac_ 10 | 0.0554 | 0.1219 | 0.1931 | 0.1902 |
vc_sac_ 25 | 0.0559 | 0.1239 | 0.1443 | 0.1373 |
Training | Testing | |||
---|---|---|---|---|
vc_sac_ 125 | 0.0428 | 0.1115 | 0.0919 | 0.0892 |
vc_sac_ 50 | 0.0431 | 0.103 | 0.0689 | 0.0575 |
vc_sac_ 100 | 0.0473 | 0.1143 | 0.1089 | 0.0828 |
vc_sac_ auto | 0.049 | 0.109 | 0.228 | 0.1628 |
vc_sac_ 10 | 0.0554 | 0.1219 | 0.1931 | 0.1902 |
vc_sac_ 25 | 0.0559 | 0.1239 | 0.1443 | 0.1373 |
sac_ 10 | 0.0589 | 0.1338 | 0.2585 | 0.2602 |
sac_ 125 | 0.0628 | 0.1299 | 0.1811 | 0.192 |
sac_ 25 | 0.0687 | 0.1475 | 0.2415 | 0.1712 |
sac_ 100 | 0.0693 | 0.1599 | 0.2363 | 0.1905 |
sac_ auto | 0.0693 | 0.1529 | 0.2373 | 0.1675 |
sac_ 50 | 0.0737 | 0.1609 | 0.2342 | 0.1998 |
baseline_tvac | 0.0883 | 0.1894 | 0.2926 | 0.2037 |
baseline_timevariant | 0.0888 | 0.1991 | 0.3287 | 0.3052 |
baseline_constant | 0.0893 | 0.1856 | 0.3802 | 0.2822 |
baseline_random | 0.1044 | 0.2032 | 0.3812 | 0.3125 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
von Eschwege, D.; Engelbrecht, A. Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics 2024, 12, 3481. https://doi.org/10.3390/math12223481
von Eschwege D, Engelbrecht A. Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics. 2024; 12(22):3481. https://doi.org/10.3390/math12223481
Chicago/Turabian Stylevon Eschwege, Daniel, and Andries Engelbrecht. 2024. "Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation" Mathematics 12, no. 22: 3481. https://doi.org/10.3390/math12223481
APA Stylevon Eschwege, D., & Engelbrecht, A. (2024). Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics, 12(22), 3481. https://doi.org/10.3390/math12223481