Next Article in Journal
Two–Stage Detection and Localization of Inter–Frame Tampering in Surveillance Videos Using Texture and Optical Flow
Previous Article in Journal
A Novel Axial Load Inversion Method for Rock Bolts Based on the Surface Strain of a Bearing Plate
Previous Article in Special Issue
A Particle Swarm Optimization-Based Interpretable Spiking Neural Classifier with Time-Varying Weights
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation

by
Daniel von Eschwege
1,† and
Andries Engelbrecht
1,2,3,*,†
1
Department of Industrial Engineering, Stellenbosch University, Stellenbosch 7600, South Africa
2
Computer Science Division, Stellenbosch University, Stellenbosch 7600, South Africa
3
GUST Engineering & Applied Innovation Research Centre, Gulf University for Science and Technology, Mubarak Al-Abdullah 32093, Kuwait
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2024, 12(22), 3481; https://doi.org/10.3390/math12223481
Submission received: 24 September 2024 / Revised: 22 October 2024 / Accepted: 30 October 2024 / Published: 7 November 2024
(This article belongs to the Special Issue Heuristic Optimization and Machine Learning)

Abstract

:
Particle swarm optimisation (PSO) is a swarm intelligence algorithm that finds candidate solutions by iteratively updating the positions of particles in a swarm. The decentralised optimisation methodology of PSO is ideally suited to problems with multiple local minima and deceptive fitness landscapes, where traditional gradient-based algorithms fail. PSO performance depends on the use of a suitable control parameter (CP) configuration, which governs the trade-off between exploration and exploitation in the swarm. CPs that ensure good performance are problem-dependent. Unfortunately, CPs tuning is computationally expensive and inefficient. Self-adaptive particle swarm optimisation (SAPSO) algorithms aim to adaptively adjust CPs during the optimisation process to improve performance, ideally while reducing the number of performance-sensitive parameters. This paper proposes a reinforcement learning (RL) approach to SAPSO by utilising a velocity-clamped soft actor-critic (SAC) that autonomously adapts the PSO CPs. The proposed SAC-SAPSO obtains a 50% to 80% improvement in solution quality compared to various baselines, has either one or zero runtime parameters, is time-invariant, and does not result in divergent particles.

1. Introduction

Particle swarm optimisation (PSO), introduced by Kennedy and Eberhart [1], is a swarm intelligence algorithm modelled after the swarming behaviour of birds in a flock. PSO finds candidate solutions by iteratively updating the positions and velocities of particles in a swarm. Position and velocity updates for each particle are calculated using the particle’s momentum, the best position the particle has visited so far, and the best position visited by any particle in that particle’s neighbourhood. PSO is ideally suited to problems with multiple local minima and deceptive fitness landscapes, where traditional gradient-based algorithms fail. PSO performance depends on the use of a suitable control parameter (CP) configuration, which governs the trade-off between exploration and exploitation in the swarm [2,3,4,5]. The CP configuration also determines whether the swarm reaches an equilibrium state or diverges. CPs are problem-dependent, and CP configurations that perform well over a large set of problems may not be suitable for the specific problem at hand [3,6], because the exploration-exploitation trade-off is fitness-landscape dependent. However, tuning CPs for a specific problem is computationally expensive.
Self-adaptive particle swarm optimisation (SAPSO) algorithms aim to adaptively adjust CPs during the optimisation process to improve performance, ideally while reducing the number of performance-sensitive CPs [7]. Recent studies have, however, shown that most SAPSO approaches introduce more parameters to which PSO performance is sensitive than they manage to adapt [7,8]. Most SAPSO approaches also result in divergent behaviour, infeasible solutions, low particle step sizes, and are generally ineffective at attaining better solutions [7,8,9]. Given complex, deceptive optimisation problems, often with rapidly changing landscape characteristics, there is in many cases no basis to assume that what worked up to a certain point will work in the future [10]. If self-adaptation is, therefore, based on behaviour observed up to a certain point during the search, it cannot automatically be inferred that this is indicative of what will work beyond that point. Many attempts also adapt the CP configuration in accordance with the current time-step, making for a time-variant SAPSO algorithm [11]. Time-variant SAPSO is more limited, as it only makes use of the computational budget to inform CP adaptation [12]. Ideally, a time-invariant self-adaptive algorithm should automatically adapt CPs by being fitness-landscape-aware to some extent, and not just depend on the number of time-steps used. Furthermore, particles should neither diverge and leave the feasible search space, nor should they converge prematurely to a local minimum.
Reinforcement learning (RL) is the process through which an agent learns an optimal behavioural policy by trying out actions in an environment, guided by a reward signal [13]. The agent may also receive information in order to know which state the environment is in. An RL agent can, therefore, learn to adjust the CPs of a PSO (action), based on the particle velocities or positions (observation state), guided by the best solution found at the current time-step (reward). The soft actor-critic (SAC) [14] is a specific RL formulation which employs continuous action and state spaces, compatible with the continuous domain of PSO variables, such as particle positions and velocities, and is, therefore, suitable towards dynamic adaptation of PSO CPs during the search process.
The contribution of this study is, therefore, the design and extensive empirical analysis of a truly self-adaptive, RL-based PSO algorithm. The proposed SAC-SAPSO obtains a 50% to 80% improvement in solution quality over relevant performance baselines. The SAC-SAPSO also reduces the number of runtime parameters to either one or zero, is time-invariant, and does not exhibit divergent search behaviour. Finally, whereas RL has been implemented to pre-tune hyperparameters for other heuristics [15,16,17,18], no existing research makes use of RL to dynamically adapt CPs during runtime.
The rest of this paper is structured as follows: Section 2 explains the fundamentals of PSO, RL, SAC, and some existing SAPSO algorithms. Section 3 elaborates on design decisions pertaining to the SAC-SAPSO, and Section 4 relates the experimental procedure followed as well as relevant evaluation metrics. Section 5 demonstrates characteristic search behaviour of SAC-SAPSO and discusses the results obtained, Section 6 summarizes the conclusions drawn, and Section 7 outlines future work.

2. Background

This section explains the PSO algorithm, the importance of CPs, methods such as velocity clamping, SAPSO algorithms proposed in the literature, RL principles, and the SAC algorithm.

2.1. Particle Swarm Optimisation

PSO performs swarm-based search to solve optimisation problems by updating particle velocities and then positions on each time-step, based on the best position a specific particle has found (cognitive component), the best position found by the swarm as a whole (social component), and the previous velocity, or inertia, of the particle [1,19]. The velocity update equation per dimension is given as
v i j ( t + 1 ) = ω v i j ( t ) + c 1 r 1 i j ( t ) y i j ( t ) x i j ( t ) + c 2 r 2 i j ( t ) y ^ i j ( t ) x i j ( t )
where at time t, particle i in dimension j has a velocity of v i j ( t ) , a position of x i j ( t ) , a personal best of y i j ( t ) , and a neighbourhood best of y ^ j ( t ) . The CPs are ω , c 1 , and c 2 for the inertia weight, cognitive, and social acceleration components, respectively. Sampled uniformly from (0,1), r 1 j and r 2 j introduce stochasticity into the search. Position updates are subsequently calculated per particle i using
x i ( t + 1 ) = x i ( t ) + v i ( t + 1 )
where v i is the velocity vector and x i is the position vector.

2.2. Control Parameter Configurations

Constant CP values are often selected as ω = 0.729844 , c 1 = 1.496180 and c 2 = 1.496180 [7,20]. However, successful adjustment of the CPs in order to manage the trade-off between exploration and exploitation improves the success of PSO, due to unexplored regions being explored sufficiently, while not wasting computational effort on evidently non-promising areas of the fitness landscape. Usually, more exploration (high ω , high c 1 , low c 2 ) is initially desirable, shifting to exploitative behaviour (low ω , low c 1 , high c 2 ) towards the end of the search. CPs can, therefore, be adjusted in a time-variant fashion using
ω ( t ) = 0.4 t n t n t 2 + 0.4 c 1 ( t ) = 3 t n t + 3.5 c 2 ( t ) = + 3 t n t + 0.5
where t is the current time-step and n t is the maximum number of time-steps (adapted from [21] to satisfy c 1 + c 2 > 4 [19]). Similarly to Equation (3), the time-variant acceleration coefficients PSO (PSO-TVAC) [11] also adjusts c 1 and c 2 , but keeps ω at a fixed value. Time-variant SAPSO algorithms thus adjust CPs as a function of the current time-step and the total computational budget. Time-invariant SAPSO algorithms, however, adjust CPs as a function of the particle swarm’s properties, for example, the average particle velocity, or the rate of improvement in solution quality.

2.3. Convergence Condition

Particles with CPs which adhere to the stability condition [22,23] given by
c 1 + c 2 < 24 1 w 2 7 5 w and w [ 1 , 1 ]
are considered stable, since Equation (4) ensures that the particle swarm will in the limit obtain an equilibrium state of zero velocity. This equilibrium state may, however, take very long to achieve, because the stability condition does not make a claim with regards to the number of iterations required to reach this state [24]. Similarly, velocities may also assume very large values during the search process before the equilibrium state is reached.

2.4. Velocity Clamping

Velocity explosion was first observed by Kennedy and Eberhart [19], and is the phenomenon whereby particle velocities grow unboundedly and sometimes leave the search space without ever returning [9]. Velocity explosion is especially common in high-dimensional search spaces, and can be countered using velocity clamping, which imposes limits on the maximum allowable velocity values by introducing the bound v m a x [25]. Given a search space with boundaries , u , velocities can be clamped per dimension by constraining the velocity component in each search dimension j to a fixed value, according to
v i , j t + 1 = v i , j t + 1 if v max , j v i , j t + 1 v max , j v max , j if v max , j < v i , j t + 1 v max , j if v i , j t + 1 < v max , j
for
v max , j = δ u j j , δ ( 0 , 1 )
Alternatively, velocity magnitudes can be clamped by setting an upper bound on the magnitude of the velocity vector, i.e.,
v i t + 1 = v i t + 1 if v i t + 1 < = v max v max v i t + 1 v i t + 1 if v i t + 1 > v max
for
v max = δ j = 1 n x u j j 2 = δ | u |
where u = u 1 , , u n x T and = 1 , , n x T are n x -dimensional vectors.
Whereas clamping by magnitude purports to having the advantage of maintaining directionality, since all velocity components are scaled by the same factor, it has the disadvantage that large velocity components dominate, and small components might become infinitesimal. Conversely, velocity clamping by dimension modifies the particle’s trajectory, but smaller velocity components maintain significance. However, since the random components r 1 and r 2 modify the particle’s trajectory in all dimensions anyway, velocity clamping per dimension is preferred here.

2.5. Self-Adaptive Particle Swarm Optimisation

Numerous SAPSO algorithms exist in the literature, but many of them lead to particles that diverge into infeasible search space, or premature convergence due to step sizes which quickly approach zero [8]. Additionally, most of these algorithms add more CPs than are originally present. This observation is supported by another study [7], which found that many of the 18 examined SAPSO algorithms either displayed divergence or premature convergence. Efforts to suitably adjust the CPs ω , c 1 , and  c 2 , therefore generally fail to do so in a way that reduces the number of parameters which impact performance. Exceptions are SAPSO by [26], improved PSO by [27], and PSO with random acceleration coefficients by [7].
This section discusses SAPSO algorithms relevant for the purposes of this study.

2.5.1. Adaptive Particle Swarm Optimisation Based on Velocity Information

Adaptive particle swarm optimisation based on velocity information (APSO-VI) [28] seeks to adjust particle velocities towards an optimal, decreasing velocity by manipulating the inertia weight ω to reflect the shift from exploration to exploitation. The APSO-VI algorithm functions as follows: the average velocity of all particles in the swarm is calculated using
v a v g ( t ) = 1 n x n s i = 1 n s j = 1 n x v i j ( t )
where n x is the number of dimensions and n s is the size of the swarm. Thereafter, the optimal velocity is defined as
v i d e a l ( t ) = v s 1 + cos π t T 0.95 2
with the initial optimal velocity v s as
v s = x m a x x m i n 2
and T 0.95 is the point where 95% of the search is finalized; x m a x and x m i n are the respective boundaries of the feasible search space. The CP ω is then updated using
ω ( t + 1 ) = max ω ( t ) Δ ω , ω min if v a v g ( t ) v ideal ( t + 1 ) min ω ( t ) + Δ ω , ω max if v a v g ( t ) < v ideal ( t + 1 )
where ω m i n = 0.3 and ω m a x = 0.9 are the minimum and maximum inertia weights, and  Δ ω = 0.1 is the inertia weight step size. As the search progresses, the decrease in ω causes the CP configuration to become convergent according to Poli’s convergence criterion in Equation (4). Consequently, while initially almost the whole swarm resides in infeasible search space, upon completion of the search, only 6% of particle positions are infeasible.
APSO-VI is of interest, because it reflects the intuitive expectation that a successful SAPSO algorithm should exhibit a transition from exploration towards exploitation as the search progresses. Whether this characteristic is automatically learned by an RL-based SAPSO algorithm might be informative.

2.5.2. Adventurous Unified Adaptive Particle Swarm Optimisation

Adventurous unified adaptive particle swarm optimisation (UAPSO-A) [29] makes use of learning automata, which are defined as adaptive decision-making devices. As the learning automata interact with their stochastic environment, they begin to make better decisions, in this case choosing new CP values for the next time-step.
The UAPSO-A algorithm functions as follows: for each CP, a separate learning automaton selects new CP values from a discrete, predefined range of values on each time-step. The inertia weight range consists of n ω equidistant values in ω min , ω max = [ 0 , 1 ] . The cognitive and social components’ range consist of n c equidistant values in c min , c max = [ 0 , 4 ] . A feedback signal instructs the automaton. If selection of the CP value at index i improved solution quality, selection probabilities are updated using
p j ( t + 1 ) = p j ( t ) + a 1 p j ( t ) if i = j p j ( t ) ( 1 a ) otherwise
where p j is the selection probability of the CP value at index j. Furthermore, a is the reward step size, which increases the likelihood of reselection for the successful CP value and decreases the likelihood for all other CP values. If selection of the CP value at index i did not improve the solution quality, selection probabilities are updated using
p j ( t + 1 ) = p j ( t ) ( 1 b ) if i = j b | A | 1 + p j ( t ) ( 1 b ) otherwise
where b is the penalty step size, and  | A | is the number of possible CP values to choose from. The likelihood of reselection of the unsuccessful CP value is, therefore, decreased.
UAPSO-A demonstrates convergent behaviour despite not enforcing Poli’s convergence criterion in Equation (4). Between 60% and 80% of particles have convergent CP values at any given time-step. As a result, many particles initially violate boundaries of the feasible search space, but the bound violations decrease smoothly to less than 40% towards the end of the search process as the parameters are adapted [7]. Finally, UAPSO-A adapts ω , c 1 , and c 2 , but introduces nine constants, namely, n ω , ω min , ω max , n c , c min , c max , a, b, and τ . UAPSO-A is of interest because the learning automata can be considered as a rudimentary form of RL. RL similarity lies in that the automata take actions by making decisions, informed by a reward signal which is based on the objective function value of the particle.

2.6. Reinforcement Learning

RL is the process by which an agent learns an optimal action to take in a given situation, guided by a reward signal [13]. In many cases, the agent observes the environment, meaning that it receives information that conveys the current state of the environment. The agent learns by interacting with its environment and trying out different actions, and receives rewards correspondingly. It should also sufficiently explore the possible action space, while still eventually converging to an optimal path of action. Furthermore, actions taken at a certain point in time usually affect the environment and future reward. The paradigms of trial-and-error, exploration-exploitation, and delayed gratification are, therefore, central to RL. Learning from experience in this fashion allows RL to engage with interactive and unfamiliar problems where examples of optimal behaviour might not be available, and which are, therefore, not readily addressable by supervised learning techniques.
This section elucidates the concepts of RL and the SAC algorithm. It introduces the principles of RL, outlines the SAC algorithm, and discusses the mathematical underpinnings of SAC, including policy iteration, entropy maximisation, and the modified Bellman backup operator. The section culminates by emphasising the scalability and stability of SAC, its ability to address high sample complexity and convergence brittleness, and its effectiveness in balancing exploration and exploitation.

Soft Actor-Critic

Introduced by Haarnoja et al. [14], the SAC is an RL algorithm that addresses the challenges typically faced by model-free RL, such as high sample complexity and convergence brittleness. High sample complexity implies a large computational cost, and convergence brittleness necessitates tedious hyperparameter tuning to ensure stable learning and convergence. In contrast to most RL algorithms, SAC aims to maximise not only the expected reward, R, but also the action-selection entropy. This is achieved by taking maximally random actions off-policy, while still optimising the policy for the given task. These random, albeit high-entropy actions, encourage diverse behaviour [30,31] and promote exploration of the action space. High-entropy actions provide resilience against estimation errors [32], as the policy assigns similar probabilities to actions with similar values to prevent the policy from continually reselecting the same action. Moreover, SAC can handle continuous, high-dimensional state and action spaces, which often pose problems for convergence stability [33].
The key constituents of SAC include an actor for the policy function, and a critic for the value function. Furthermore, off-policy learning increases efficiency by incorporating previously collected data, and entropy maximisation enhances exploration and reduces brittleness. In adherence to the actor-critic formulation, SAC performs policy iteration by cycling between policy evaluation, which calculates the value function v π of a policy π , and policy improvement, where the value function is used to improve the policy [13]. Policy iteration for SAC has a convergence proof [14]. Entropy maximisation ensures that random actions are chosen using
J ( π ) = t = 0 T E s t , a t ρ π r s t , a t + α H π · s t
where s t represents the current state in state space S , and  a t is the current action in action space A . The expectation is taken over the state transition probability, ρ π s t . The reward function is denoted by r. This formulation yields a Markov decision process (MDP) defined as ( S , A , p , r ) [13]. The temperature parameter α balances the weighting between entropy and reward, where a lower α results in less randomness in the optimal policy v and, consequently, less exploration.
The Q-value represents the expected reward for a certain state-action pair, and is obtained by iteratively calculating the modified Bellman backup operator using
T π Q s t , a t r s t , a t + γ E s t + 1 p V s t + 1
from any starting function Q : S × A R . The discount factor, γ , determines how myopic the agent is. For a small γ , the agent optimizes for immediate reward at the expense of potentially more reward later. Conversely, a higher γ incentivises ‘delayed gratification’, where the agent might sacrifice reward at the current time-step if such action ensures increased reward at a later stage. After calculation of Equation (16), the value of the policy is calculated as
V s t = E a t π Q s t , a t log π a t s t
During policy improvement, π is updated using the Kullback–Leibler divergence [34] projection to constrain the policy to the set of tractable policies, Π , according to
π new = arg min π Π D KL π · s t | | exp Q π old s t , · Z π old s t
where Z π old s t is the partition function which normalises the distribution.
In order to extend Equations (16)–(18) to large, continuous domains, the policy π must be approximated to the tractable policy π ϕ , the Q-function Q to the “soft” Q-function Q θ , and the value function V to an approximate state value function V ψ . Parameters ϕ , θ , and  ψ , respectively, govern these approximations. Also, since the complete, iterative evaluation of Equation (16) quickly becomes intractable, both approximations are updated alternatingly, rather than executing both evaluation and improvement to completion. The soft Q-function is, therefore, minimised with respect to the squared residual error,
J V ( ψ ) = E s t D 1 2 V ψ s t E a t π ϕ Q θ s t , a t log π ϕ a t s t 2
where the error is the difference between the approximated value V ψ , and the expectation of Q θ in conjunction with the entropy of π ϕ . The expectation is taken over all of D , the distribution of all previous states and actions.
The SAC algorithm is a powerful tool for solving RL problems, demonstrating high efficiency and robustness. The actor-critic architecture allows for more efficient and effective learning by combining both value-based and policy-based methods. Entropy maximisation encourages diverse behaviour and exploration, reducing the brittleness commonly associated with model-free RL methods. The use of off-policy learning enhances sample efficiency, making better use of previously collected data. Overall, SAC offers a scalable and stable approach to RL, particularly in high-dimensional and continuous state and action spaces, and strikes a balance between exploration and exploitation.

2.7. Reinforcement Learning as a Hyper-Heuristic

Selection hyper-heuristics are control mechanisms that identify the most suitable optimisation technique over time, without complete knowledge of the problem domain [35]. Selection hyper-heuristics can function as operators which choose from a collection of population-based optimisation methods like PSO, which run concurrently. This approach is known as the heterogeneous meta-hyper-heuristic (HMHH) framework [36,37,38]. Although the focus of this paper is not the design of selection hyper-heuristics, HMHH warrants mention as it has seen the application of RL to population-based optimisation methods. Note, however, that in this context, RL does not refer to an agent being trained, but rather to the fact that a feedback signal is used to change the selection of the optimisation method.
This section examines the application of RL in selection hyper-heuristics within the HMHH framework. Two RL-inspired approaches, frequency improvement reinforcement learning Selection (HH_ReinfFr) and objective function proportional reinforcement learning selection (HH_ReinfPr), are discussed.

2.7.1. Frequency Improvement Reinforcement Learning Selection

HH_ReinfFr [35] is a hyper-heuristic approach inspired by the works of [39,40]. In this method, each heuristic h m has a rank score r m , initially set to zero. The heuristics are rewarded or punished based on the frequency with which they improve the quality of assigned candidate solutions. The change in rank for a heuristic h m is calculated as Δ r m ( t ) = χ m ( t ) , where χ m ( t ) represents the frequency score according to
χ m ( t ) = m = 1 n t i = 1 n m + 1 if f i improved 1 if f i remained the same 1 if f i worsened
where f i is the quality of candidate solution i, n m is the number of candidate solutions assigned to heuristic h m , and  n t is the number of iterations over which the calculation is performed. The range of χ m ( t ) is n s × n t , n s × n t . Rank update happens according to r m ( t ) = r m ( t n t ) + Δ r m ( t ) . Candidate solutions are assigned to the highest-ranked heuristic, following the approach by [40], in effect, blacklisting all other heuristics. The values in Θ ( t ) are assigned as P m = 1 for m = r for the highest-ranked heuristic h r H , while P m = 0 for m r . Rank ties are resolved randomly, and the same Θ ( t ) is employed for assigning all candidate solutions.

2.7.2. Objective Function Proportional Reinforcement Learning Selection

HH_ReinfPr [35] is another hyper-heuristic approach. While similar to HH_ReinfFr, the difference lies in the adjustment of ranks, which is based on the mean change in quality of candidate solutions. The mean change in solution quality is calculated using
ϕ m ( t ) = i = 1 n m f i ( t ) f i ( t n t ) n m
The change in rank Δ r m ( t ) for each heuristic h m is calculated as
Δ r m ( t ) = + 1 if ϕ m ( t ) > 0 1 if ϕ m ( t ) = 0 1 if ϕ m ( t ) < 0
Ranks are updated using the formula r m ( t ) = r m ( t n t ) + Δ r m ( t ) . Initial ranks for all heuristics are r m = r min = 0 . As for HH_ReinfFr, candidate solutions are assigned to the highest-ranked heuristic in accordance with the approach by [40]. Θ ( t ) probabilities are set as P m = 1 for m = r in the case of the highest-ranked heuristic h r H , while P m = 0 for m r . Rank ties are broken at random, and the same Θ ( t ) is used for assigning all candidate solutions. The difference between HH_ReinfFr and HH_ReinfPr is that the former rewards heuristics based on the frequency with which they improve solution quality, while the latter rewards heuristics based on the mean change in quality of candidate solutions assigned to them.

3. Design of Soft Actor-Critic Self-Adaptive Mechanism

This section details the design of the SAC-SAPSO algorithm, which uses the SAC to dynamically adjust the PSO CPs. Initially, the section motivates why RL is ideally suited, and a suitable RL algorithm is chosen. Thereafter, the SAC-SAPSO architecture is described, including design choices pertaining to the RL environment, agent, and reward function.

3.1. Motivation for Reinforcement Learning for Self-Adaptation

Development of a self-adaptive PSO algorithm essentially amounts to the adjustment of CP values, based on the behaviour of the particles, so as to improve the solution quality. This can be framed as an RL problem, where an agent finds a policy to govern its actions (choosing new CP values), based on observations (behaviour of the particles), in order to maximise a reward signal (solution quality). Using RL for SAPSO is somewhat related to UAPSO-A [29], because the UAPSO-A learning automata can be considered as simple RL agents which receive reward if the selected CP values improved solution quality. However, UAPSO-A is subject to several limitations:
  • UAPSO-A does not consider particle characteristics such as velocity or position to inform CP updates.
  • Orthogonal automata are used for each CP, which precludes the decision-making mechanism from considering the other CP values when updating the current CP.
  • The UAPSO-A mechanism makes use of simplistic probability-based decision-making, which assumes previously successful CP values to be successful in the future. This assumption is not necessarily true, given deceptive optimisation problems with changing function landscape characteristics.
  • UAPSO-A can only select CP values from a discrete set, disallowing fine-grained adjustment.
Moreover, RL has been used to statically pre-select hyperparameters or to tune hyperparameters [15,16,17,18]. However, no examples were found where RL is used to dynamically adjust hyperparameters during execution of the algorithm. Algorithms such as HH_ReinfFr and HH_ReinfPr (discussed in Section 2.7.1 and Section 2.7.2) also use RL, but they select amongst different heuristics, which also differs greatly from dynamic adjustment of PSO CPs during the search process.
A significant advantage of using RL for self-adaptation is that an RL agent can automatically discover the best policy. Most SAPSO approaches are based on inductive priors, i.e., the authors believe that adapting the CPs in a certain way is more likely to lead to a good outcome. For example, PSO-TVAC (see Section 2.2) is based on the assumption that decreasing particle inertia will lead to improved performance. While this may well be the case, it is difficult to reason a priori about the behaviour of the PSO swarm. For example, there is complex interplay between the fitness landscape, the CP configuration, and the particle positions and solutions. Correlations between these factors cannot easily be captured by a simple heuristic, but can be learned by an RL agent, which can then adjust the CP configuration accordingly. Also, if performed manually, only a few approaches toward SAPSO can be tried, whereas an RL agent explores millions of policies.
Using supervised neural networks (NNs) to adapt PSO CPs has also been considered. For example, the input to the NN could be information about the particles, such as fitness landscape characteristics at current positions, and the output could be the CP configuration predicted for the next time-step. This approach is inferior to RL for the following reasons:
  • Labelling vs. Interacting: In order to train the NN, a labelled dataset is required. Since the NN has to predict the CPs, supervised learning requires knowledge of the “correct” CPs in order for the NN to train, which is, of course, not known in advance. RL bypasses this issue, as the agent interacts with the environment autonomously, and learns from experience. In this case, the agent is the SAC, and the environment is the PSO search process, which provides information such as particle velocities and solutions to the agent.
  • Exploration vs. Exploitation: Just as the particle swarm should ideally maintain a suitable balance between exploration and exploitation, this balance should also be present in the CP space. NNs are not capable of exploring the CP space, but merely learn from labels. Conversely, the exploration-exploitation paradigm lies at the heart of the SAC formulation, as discussed in Section 2.6. The SAC agent explores the CP space by taking random actions, and exploits the CP space by taking actions which have been shown to be successful in the past.
To conclude, RL offers a promising avenue for self-adaptation in PSO algorithms by dynamically adjusting CP values during the execution of the algorithm. RL overcomes several limitations of existing methods such as UAPSO-A and supervised neural networks by enabling a more fine-grained, informed, and adaptive decision-making process. By interacting with the environment and learning from experience, an RL agent can discover optimal policies and maintain a balance between exploration and exploitation in the CP space, leading to improved performance and solution quality in complex optimisation problems.

3.2. Reinforcement Learning Algorithm Selection

A plethora of RL algorithms exist, some more suitable than others towards SAPSO. The following criteria are considered:
  • Model-free, not model-based: Since there is no preexisting model of how the CP configuration should be adapted (if there was, the problem would be solved already), the agent has to learn in a model-free way.
  • Off-policy, not on-policy: Exploration of the CP space is aided if the agent is permitted to take actions off-policy. If the agent only takes actions on-policy, the only possible actions are those which have been successful in the past, increasing the likelihood of becoming stuck in a local optimum.
  • Continuous action space, not discrete: The possible CP values which can be selected form a continuous range and the policy should, therefore, have a continuous action space. Contrast this with the UAPSO-A algorithm in Section 2.5.2, which outputs a range of discrete values. Discretisation adds complexity because the agent chooses between n different values on every CP update. Nonetheless, it has less flexibility than a continuous action space because it is still limited to those n specific values.
  • Continuous state space, not discrete: All variables in the PSO formulation (i.e., particle positions, velocities, etc.) are continuous floating-point values. Since these are the variables which the agent observes, the agent should have a continuous state (observation) space.
A comparison of RL algorithms is given in Table 1, from which deep deterministic policy gradient (DDPG), twin delayed DDPG (TD3), and SAC are eligible according to the aforementioned criteria. Since TD3 and DDPG are more prone to hyperparameter instability [14], they are eliminated for the purposes of this study.
Furthermore, in cases where the RL environment is not computationally expensive to simulate or execute, it is often preferable to use an algorithm with lower sample efficiency and to run it for more iterations. However, since evaluation of the environment (i.e., the PSO search) amounts to computing velocities and positions for n s particles in n d dimensions, this makes for a rather complex environment. For this reason, an algorithm with higher sample efficiency allows for the environment to be evaluated less often. SAC fits the aforementioned criteria, because it has higher-than-normal sample efficiency. It is also robust to convergence brittleness, as discussed in Section 2.6, and is, therefore, the algorithm of choice for the development of the RL-based SAPSO.

3.3. SAC-SAPSO Architecture

The architecture of SAC-SAPSO consists of the PSO and the SAC agent, which are implemented as described in Section 2.1 and Section 2.6, respectively. Interaction between PSO and SAC lies in that the SAC agent observes the PSO search process, which is the environment. The SAC agent then takes actions which influence the PSO environment.
This section presents the SAC-SAPSO architecture, consisting of the PSO and a SAC agent. The interaction between these components is explained, as well as the PSO environment details, the SAC agent’s actions, and the formulation of the SAC-SAPSO reward signal. Detailed discussion covers topics like PSO metrics, normalisation of velocities, SAC agent’s policy, and benefits and drawbacks of absolute and relative reward signals.
The SAC-SAPSO is summarised in Algorithm 1.  
Algorithm 1: SAC-SAPSO Algorithm
Mathematics 12 03481 i001

3.3.1. PSO as the Environment

The PSO search process serves as the environment for the SAC agent, which observes particle velocities, percentage stable particles, percentage infeasible particles, and the percentage completion of the search at any given time-step. See Section 4.1 for a discussion of these metrics, except for percentage completion, which is defined as the number of the current time-steps divided by the maximum number of time-steps. The observation space thus has a dimensionality of n s + 3 , where n s is the number of particles in the swarm. This is because particle velocities are observed for each particle, while the other metrics are calculated for the entire swarm. To ensure a unitary observation space for the SAC agent, particle velocities are normalised and scaled according to the transformation,
v norm = tanh 2 u · v + u 2
where v is the velocity vector of the particle, is the lower bound of the feasible search space, and u is the upper bound of the feasible search space. Equation (22) first normalises particle velocities by the boundaries of the feasible search space. This ensures that a velocity of 1 corresponds to traversal of the entire search space in one time-step. However, since velocity explosion [9,25] is a common problem in PSO, particle velocities can assume extremely large values. To still retain a unitary observation space, the normalised velocity vector is then squashed using the hyperbolic tangent function to the range of [ 1 , 1 ] . Intuitively, a particle with a velocity which exploded to 10 5 has just as little chance of finding a good solution in feasible space as does a particle with a velocity of 10 10 . The hyperbolic tangent function accounts for this, with all such large velocities being squashed to approximately 1.

3.3.2. SAC as the Agent

The SAC agent takes actions according to a policy, which governs the adjustment of the PSO CPs during the search process. Observation of the PSO environment is performed every n t time-steps. Upon observing the environment, the agent outputs as actions the continuous CP values of ω , c 1 , and c 2 at that time-step. Since all particles are given the same CP values, the action space has a dimensionality of 3. These values are then used to update the PSO CPs according to the velocity update rule in Equation (1).
Updating the CPs on every time-step (setting n t = 1 ) is inefficient, because the search characteristics do not change sufficiently during one time-step to necessitate a new CP configuration. A larger value for n t can be used, but introduces an additional hyperparameter. Alternatively, n t can form part of the action space, thus allowing the agent to autonomously learn an optimal value for n t . The latter option has as an advantage that the agent can learn to adjust the observation-action frequency according to the current state of the search process, potentially improving the search performance. The three options are, therefore, n y = 1 to adapt on every time-step, n t = t to adapt every t time-steps, and n t A (the action space).

3.3.3. SAC-SAPSO Reward Signal

The reward signal informs an RL agent whether its actions achieve the desired outcome. The reward acts as a feedback mechanism and guides the agent towards learning the optimal policy, which maximizes the expected cumulative reward over time [13]. Proper reward shaping is crucial to align the goal of the SAC agent with that of the PSO, namely, to ensure that CP adaptation improves search performance. Since the minima and maxima of the training function set are known, it is straightforward to relate the global best solution to an absolute normalised reward signal according to
r abs ( t ) = y f ( t 1 ) y f ( t ) y m a x y m i n
where y f is the global best solution at time-step t. The term ‘absolute’ indicates that the normalisation is performed using the absolute minimum and maximum of each function in the training function set. Note that the generic formula for absolute change is usually defined as
δ abs = y n e w y o l d y m a x y m i n
However, in Equation (23), the numerator is reverted, because a decreasing global best solution implies an increasing reward. Normalisation is necessary, because the functions lie on vastly different magnitude ranges, and ensures a direct relationship between reward maximisation and improvement of the global best solution found. Equation (23) has the reward landscape shown in Figure 1a.
The reason for computing the numerator of r abs ( t ) with respect to y f ( t 1 ) instead of y max is to ensure that the maximum possible reward for a given function, which aggregates during the PSO search of that function, equals 1. The agent thus receives a cumulative reward of 1 only if the PSO manages to find y min during the search process. Similarly, a cumulative reward of 0.5 indicates that the PSO only found a global best solution halfway between y max and y min . If the reward was calculated with respect to y max , this would provide less incentive towards finding an optimal solution. For example, if a global best solution at a given time-step produced a reward of 0.5, the agent would continue to receive a reward of at least 0.5 for all subsequent time-steps, even if the global best solution stagnated there.
The reward formulation of Equation (23) is also preferable to that of UAPSO-A (Section 2.5.2), where a constant amount of reward is generated if the objective function value improved, irrespective of the degree of improvement. This UAPSO-A reward formulation could incentivise the agent into optimising for small, incremental steps, instead of just solving the function optimally.
Equation (23) evidently has the problem that it can only be used if the minimum and maximum of the objective function are known. If the minimum is known, the objective function has, of course, already been solved. To circumvent this, Equation (23) is modified to obtain a relative reward signal, defined as
r rel ( t ) = 0 if y f ( t ) = y f ( t 1 ) 2 · y f ( t 1 ) + β y f ( t ) + β y f ( t 1 ) + β if y f ( t ) > 0 , y f ( t 1 ) > 0 2 · y f ( t 1 ) + 2 β y f ( t ) + 2 β y f ( t 1 ) + 2 β if y f ( t ) < 0 , y f ( t 1 ) < 0 1 otherwise
where β = | y f ( t 1 ) | + | y f ( t ) | . The reward, r rel ( t ) , is, therefore, always only calculated as the improvement in solution quality with respect to the previous time-step, and never makes use of the absolute maximum and minimum of a function. However, functions can take on any range; that is, y min and y max can both be positive, both be negative, or negative and positive, respectively, and can be arbitrarily large or small. This precludes simply using the general formula for relative change (with an inverted numerator),
δ rel = y old y new y old
which only works if all values have the same sign, and does not provide unitary normalisation. If not, Equation (26) results in a jump discontinuity as soon as y new changes from positive to negative during the PSO search. Also, when both y old and y new are negative, the reward is distorted and inverted. The aforementioned problems are clear in Figure 1c.
Equation (25) addresses these problems by shifting y old and y new by 2 β if both are negative, β if both are positive, and scales by 2 for both. The shift is calculated in this way to ensure that after shifting, both y old and y new are positive and non-zero. Equation (25) also places a fixed limit of r rel max = 1 for the rare time-steps where the global best solution transitions from positive to negative. The scaling factor of 2 ensures continuity between the four possible cases, with the resulting reward landscape shown in Figure 1b. Note that the objective function value y in Figure 1b,c has not been normalised, with 3 , 3 selected arbitrarily, because the range of y can take any value. Conversely, y is always normalised before calculating Equation (23) to obtain the landscape in Figure 1a, which is why the range is 0 , 1 . Furthermore, while the area in Figure 1b, where r rel = 1 , appears to be very large, the reality is that it only occurs at time-steps where y ( t 1 ) > 0 and y ( t ) < 0 . These time-steps are rare, because they only occur one or zero times during the PSO search.

4. Experimental Procedure

This section elaborates on the experimental procedure, focusing on evaluation metrics and specifics regarding the PSO implementation. Following this, the details of the algorithm’s implementation are presented. This includes the nature of RL, the role of training and testing sets, and the specifics of conducting an experiment. Lastly, the section delves into the hyperparameters used during the training phase of the SAC algorithm, underscoring their importance in the overall effectiveness and functionality of the SAC-SAPSO algorithm. As various aspects of the algorithm are assessed, a ‘run’ is defined as a single SAC-SAPSO execution on a single function, while an ‘experiment’ denotes r = 30 independent runs for each function in the entire function set.

4.1. Evaluation Metrics

Multiple metrics are utilised to examine the performance and characteristics of the SAC-SAPSO:
  • Normalised global best solution quality. The objective function value, or solution quality at the global best positions, signifies how well an algorithm solves a given optimisation problem. Because functions differ in their ranges, normalised global best solution quality is employed. Once all experiments have concluded, the highest and lowest global best solutions discovered throughout all experiments and runs are used to normalise global best solutions to [0, 1] for each experiment. The scale used for normalisation is chosen at will, but unitary scaling leads to easily comprehensible results, where lower values represent better solutions.
  • Average swarm diversity offers insights into the balance of exploration and exploitation. Diversity is computed using [50]
    D = 1 n s i = 1 n s j = 1 n x x i j x ¯ j 2
    with the swarm center located at
    x j = i = 1 n s x i j n s
    where n s represents the number of particles in the swarm, and n x refers to the number of dimensions.
  • Percentage of particles in infeasible space. A particle is considered to be in infeasible space if it breaches the boundaries of the feasible search space in at least one dimension. Infeasible particles should not be taken into account when updating the best positions found, so as not to steer the search away from feasible space [9].
  • Percentage of particles that are stable. Poli’s stability condition (given in Section 2.3 in Equation (4)) determines whether a particle has convergent CP configurations, and is considered stable [51].
  • Average particle velocity denotes the average step sizes. In order for the search process to converge, step sizes have to decrease. However, in order to not become trapped in a local minimum, step sizes should not become too small too quickly. The average particle velocity is calculated using [7]
    Δ ( t + 1 ) = 1 n s i = 1 n s x i ( t + 1 ) x i ( t )

4.2. Implementation

Unlike supervised learning methods, which require separate training and testing sets to learn and evaluate the performance of a model, RL algorithms are trained on the environment for which they are used. The reason behind this difference lies in the nature of the RL learning process. In supervised learning, the model learns from a fixed set of labelled examples, and the testing set is used to evaluate the generalisation ability of the model to unseen data [52]. Conversely, RL agents take actions based on observations to solve environments [13]. Therefore, for each experiment (i.e., algorithmic aspect under investigation), an agent is trained and r = 30 independent evaluation runs. This is performed using the set which contains 45 30-dimensional minimisation functions provided in Appendix A.
However, since the SAC-SAPSO algorithm must definitely generalise, the training function set cannot also represent algorithmic performance without bias. Thus, a test set of seven additional functions is used, also detailed in Appendix A. This ensures an 85% training, 15% testing split for honest assessment of performance. The test functions have specifically been chosen to be more complex than the training set in order to demonstrate performance differences between algorithms more clearly.
The inertia weight PSO [19] is employed (described by Equations (1) and (2)), and the neighbourhood for each particle encompasses the entire swarm. The objective function value is only computed for particles situated within feasible space. Particles that fall outside the viable space are assigned infinite objective values, which, under the assumption of minimisation, disqualifies those particles from global best position updates. The swarm size is n s = 30 , with each function having a dimensional space of n x = 30 , with a maximum of t m a x = 5000 iterations for each run.
The average and standard deviation for all plots and scores are calculated across all functions and runs within a specific experiment, resulting in ( 30 × 45 ) + ( 30 × 7 ) = 1560 runs for each experiment. Given the various algorithmic components examined, a total of 15 experiments are conducted, which translates to 23,400 runs.

4.3. SAC Hyperparameters

The SAC training hyperparameters are shown in Table 2. Note that the SAC training hyperparameters differ from PSO CPs because they are only used during the training of the SAC policy, whereas the CPs are dynamically adjusted during the PSO search to improve performance. Moreover, since the impact of these hyperparameters on the SAC agent is well understood and documented [31], it is possible to reason a priori about the values that should be used. Default values are also provided in the SAC documentation (https://spinningup.openai.com/en/latest/, accessed on 1 November 2023). Finally, these are all training hyperparameters. Once a policy is learned which successfully adapts the PSO CPs on the training set, these hyperparameters are discarded, and have no bearing on the functioning of the SAPSO algorithm anymore. The SAC hyperparameters are, therefore, not counter to the aim of SAPSO.
The discount factor γ governs the trade-off between immediate and future rewards, where a low γ will result in the agent being myopic, and a high γ will result in the agent being far-sighted. For infinite-horizon RL problems, γ usually needs to be adjusted carefully. However, for episodic RL problems such as the PSO search, γ = 1 is recommended [13].
The reward scale factor f r can be used to normalise the reward signal. However, the reward formulation defined in Equation (25) is already normalised to the range [ 0 , 1 ] . Therefore, the reward is not scaled here, i.e., f r is not used. For completeness, this is equivalent to setting f r = 1 .
Actor and critic layer sizes are set to 256 each, the default values recommended by the SAC documentation. Barring the case of too small layer sizes, which could render the policy too simple to learn complex PSO dynamics, these hyperparameters are not expected to have a significant impact on the performance of the learned policy.
Two value (critic) networks and one policy (actor) network are used by the SAC [14]. The target smoothing coefficient τ controls the rate at which these networks are updated using
w target τ · w critic + ( 1 τ ) · w target
where w target represents the weights of the target network, and w critic denotes the weights of the critic network; τ lies between 0 and 1, where a smaller τ corresponds to slower updates, resulting in more stable value estimates but potentially slower learning. A larger τ leads to faster updates, which may cause higher variance in value estimates and less stable learning. The default, small value of τ = 0.005 is used here, as recommended by the SAC documentation, because slower learning during training does not have a bearing on the actual usage of the SAC-SAPSO.
The learning rate α is set to a small, default value of α = 0.0003 . Similar reasoning as for τ applies here.
Replay buffers are used to store past experiences, which are sampled from during training. The main concern regarding the replay buffer size is potential memory constraints during training. A large, default value of 10 6 is chosen.
The number of training steps are set to 2 × 10 5 . Because the training is actually stopped upon stagnation in performance of the learned policy, this value has no bearing on performance and is chosen as an arbitrary, large value.
All the hyperparameters in Table 2 are training parameters. This disjoint between training and execution ensures no runtime CPs, except potentially n t . However, n t can be set to form part of the SAC action space A , which completely automates this parameter also, as discussed in Section 3.3.2. Such a training-execution disjoint may well be the only way to adapt all PSO CPs, corroborated by the fact that all approaches to SAPSO end up introducing more runtime CPs in an attempt to adapt ω , c 1 and c 2 , as shown by [7,8].

5. Results

This section details the algorithmic aspects of different experiments, together with their respective results. Four performance baselines are presented together with their comparative outcomes. The final part explores the functionality of the SAC-SAPSO algorithm and its performance implications, highlighting how it outperforms the established baselines. Some plots contain gaps, such as Figure 11e, which occur when particles drift far enough out of the search space to cause numerical overflow in the swarm diversity calculation from Equation (27). Plot lines represent performance averages, and shaded areas the standard deviation, calculated over all functions and runs for the experiment at hand, as discussed in Section 4.2.

5.1. Performance Baselines

Four baselines are defined in this section, and their performance is presented in Table 3 and illustrated in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6. The first baseline (baseline_constant) employs the constant CP setup outlined in Section 2.2, and the second one (baseline_timevariant) applies the time-variant configuration described by Equation (3). PSO-TVAC (see Section 2.2) is used as the third baseline (baseline_tvac), and the final baseline (baseline_random) randomly samples CPs which adhere to Poli’s convergence criterion in Equation (4) at each time step, for every particle in the swarm [53,54].

5.2. Soft Actor-Critic Control Parameter Adaptation

The SAC-SAPSO algorithm is evaluated for various values of n t (sac_ n t = k ), in addition to an experiment where n t is set by the SAC algorithm itself (sac_ n t = auto), with results given in Table 4. Figure 7 shows how the CP values are adapted by the SAC-SAPSO algorithm throughout the search process, and Figure 8 shows the number of stable particles in the swarm. On average, about half of the particles in the swarm have stable (convergent) CP configurations, with generally less than 40% of particles residing in infeasible space, as seen in Figure 9. Nonetheless, Figure 10 clearly shows that some particles in the swarm have exploding velocities, which results in the exploded swarm diversity in Figure 11. In some cases, the swarm diversity becomes big enough to cause numerical overflow, indicated by gaps, such as in Figure 11e.
For sac_ n t = auto, when considering Figure 7f and Figure 8f jointly, it is interesting to note how the agent which adapts n t autonomously ultimately decides to keep exploration ( c 1 , ω ) high (with an entirely stable CP swarm!) for approximately the first 1000 time-steps. Thereafter, exploration and exploitation are promptly switched around by lowering c 1 and ω , and increasing c 2 . This is especially interesting as the agent has learned to initially maintain stable CP values by itself, without the need to introduce such an inductive prior artificially. Such behaviour suggests more exploration during the first fifth of the search, while keeping the particles mostly within the feasible search space (Figure 9f). The effect is also reflected by the particle velocities in Figure 10f and swarm diversity (Figure 11f), both of which stay low initially.
According to Table 4, the best performing sac_ n t = 10 improves upon the best performing baseline (baseline_tvac) by 33% for the training set. The test-set performance improvement is less significant at 13%. Self-adaptation using a SAC agent is, therefore, promising, because it improves on the baseline despite the velocity explosion.
Figure 11. Swarm diversity (log) for SAC.
Figure 11. Swarm diversity (log) for SAC.
Mathematics 12 03481 g011

5.3. Soft Actor-Critic Control Parameter Adaptation with Velocity Clamping

To counter the velocity explosion in Section 5.2, velocity clamping limits the maximum velocity of each particle, as discussed in Section 2.4. The velocity-clamped SAC-SAPSO results in the CP configurations are shown in Figure 12. Once again, multiple experiments are conducted for various values of n t (vc_sac_ n t = k ), as well as for the auto-tuned value of n t (vc_sac_ n t = auto), with results given in Table 5 (refer to Figure 13, Figure 14, Figure 15 and Figure 16). Figure 15 and Figure 16 confirm that velocity clamping successfully constrains particle velocities and, therefore, also swarm diversity, as no explosion is present. Table 5 shows that velocity clamping has a massive effect on performance, which is to be expected because CP adjustments can have a more fine-grained effect on particles which reside within the boundaries of the search space than over those which overshoot the search space by many orders of magnitude. The best-performing vc_sac_ n t = 125 improves on the best-performing baseline (baseline_tvac) by 52% for the training set; on the test set, the improvement is 69%.

5.4. Overview of Results

Table 6 provides an overview of performance, with the normalised global best solution quality ranked from best to worst. The improvement made by the SAC-SAPSO above the baseline PSO is clear, as is the improvement which results from velocity clamping.
Preceding velocity clamping, the best performing sac_ n t = 10 demonstrates a 34% improvement above the constant baseline, and a 33% improvement above the time-variant baseline. On the test set, sac_ n t = 10 outperforms the constant baseline by 32% and the time-variant baseline by 21%. The best performing vc_sac_ n t = 125 improves on both the constant and time-variant baselines by 52% for the training set; on the test set, the improvement is 76% and 72%, respectively. Interestingly, vc_sac_ n t = 50 actually performs better than vc_sac_ n t = 125 on the test set, outperforming the constant baseline by 82% and the time-variant baseline by 79%.

6. Conclusions

This paper presented a novel approach to the automatic adaptation of control parameters (CPs) in particle swarm optimisation (PSO). The approach is based on the soft-actor critic (SAC) algorithm, which learns a policy for the dynamic adaptation of CPs during the search process. Furthermore, velocity clamping prevents particle velocities from growing unboundedly. In conclusion, the velocity-clamped soft-actor critic self-adaptive particle swarm optimisation (SAC-SAPSO) algorithm successfully adjusted the CPs. The SAC-SAPSO achieved an improvement in performance of 52% relative to the constant CP baseline (76% on the training set), and 52% relative to the time-variant baseline (72% on the training set). The value of n t can, of course, still be considered as a CP, implying a reduction in the total number of parameters from three to one, in addition to the improvement in performance. If a search process without CPs is desired, the velocity-clamped SAC-SAPSO which automatically adapts n t (vc_sac_ n t = auto) can be used. Velocity clamping proved useful, firstly to improve performance, but also to prevent the search process from becoming unstable with particle velocities growing unboundedly. All methods displayed a degree of invariance with respect to the choice of n t , because all SAC methods outperformed all baselines, and all velocity-clamped SAC methods outperform all SAC methods.
Finally, the environment and agent were configured in an agnostic fashion; many swarm intelligence or evolutionary algorithms with tunable parameters can, therefore, be “plugged into” the framework in order to find an adaptive policy for its parameters.

7. Future Work

The environment currently only generates particle velocities, percentage stable particles, percentage feasible particles, and percentage completion of the search for the agent to observe. The agent could theoretically reconstruct a much more nuanced representation of the fitness landscape if it received positions or objective function values also, thereby leading to a more landscape-aware PSO. However, to counteract an explosion in computational complexity, dimensionality reduction techniques such as principal component analysis (PCA) or linear discriminant analysis (LDA) might have to be used to obtain a lower-dimensional representation of particle positions.
Furthermore, the action space A can be extended so that the agent may adjust the CP configuration per particle.
The algorithm was evaluated for only 5000 time-steps. An alternative option is to not set a hard limit on the maximum time-steps, thereby allowing the agent to use as few or as many time-steps as it deems necessary for the problem at hand. This may require the introduction of negative reward if the agent begins to exceed the computational budget. Similarly, the search could be terminated if the swarm converges or if improvement has stalled. The aforementioned could thus introduce adaptive behaviour to the search length also.
Similarly, negative reward can be used to punish the agent proportionally to the number of particles which leave the feasible search space. While this will definitely lead to more stable CP configurations and less infeasible particles, care must be taken when introducing such an inductive bias, as it might also degrade performance since the agent would then no longer optimise purely with respect to performance. However, artificially encouraging convergent search behaviour might very well lead the agent to find better solutions, as it helps to prevent the SAC from choosing a CP configuration which lies outside of the convergent CP region.
The performance of the SAC-SAPSO on dynamic optimisation problems will also be studied.

Author Contributions

Conceptualisation, D.v.E. and A.E.; methodology, D.v.E. and A.E.; software, D.v.E.; validation, D.v.E.; formal analysis, D.v.E.; investigation, D.v.E. and A.E.; resources, D.v.E.; data curation, D.v.E.; writing—original draft preparation, D.v.E.; writing—review and editing, A.E.; visualisation, D.v.E.; supervision, A.E.; project administration, D.v.E. and A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CPControl Parameter
PSOParticle Swarm Optimisation

Appendix A. Benchmark Functions

In the process of designing and validating algorithms for swarm intelligence, it is crucial to employ a diverse and challenging set of optimisation functions. This appendix details the optimisation function set used in the empirical evaluation of the proposed RL-based SAPSO algorithm. These functions are characteristically diverse, encompassing continuous, non-continuous, differentiable, non-differentiable, separable, non-separable, unimodal, and multimodal functions. This wide-ranging collection of functions ensures a robust evaluation of the algorithm under a spectrum of optimisation scenarios, providing a comprehensive understanding of its performance and adaptability.
The optimisation function equations are given below. Their characteristics follow in Table A1, with the test set functions at the end in bold (https://infinity77.net/global_optimization/, accessed on 1 November 2023).
  • Ackley 1 for x j [ 32 , 32 ] :
    f ( x ) = 20 e 0.2 1 n x j = 1 n x x j 2 e 1 n x j = 1 n x cos ( 2 π x j ) + 20 + e
  • Alpine 1 for x j [ 10 , 10 ] :
    f ( x ) = j = 1 n x | x j sin ( x j ) + 0.1 x j |
  • Bohachevsky 1 for x j [ 15 , 15 ] :
    f ( x ) = j = 1 n x 1 ( x j 2 + 2 x j + 1 2 0.3 cos ( 3 π x j ) 0.4 cos ( 4 π x j + 1 ) + 0.7 )
  • Bonyadi Michalewicz for x j [ 5 , 5 ] :
    f ( x ) = j = 1 n x ( x j + 1 ) j = 1 n x ( ( x j + 1 ) 2 + 1 )
  • Brown for x j [ 1 , 1 ] :
    f ( x ) = j = 1 n x 1 ( x j 2 ) ( x j + 1 2 + 1 ) + ( x j + 1 2 ) ( x j 2 + 1 )
  • Cosine Mixture for x j [ 1 , 1 ] :
    f ( x ) = 0.1 j = 1 n x cos ( 5 π x j ) + j = 1 n x x j 2
  • Cross Leg Table for x j [ 10 , 10 ] :
    f ( x ) = 1 e | 100 j = 1 n x x j 2 π | j = 1 n x sin ( x j ) + 1 0.1
  • Deflected Corrugated Spring for x j [ 0 , 2 α ] , K = 5 , α = 5 :
    f ( x ) = 0.1 j = 1 n x ( x j α ) 2 cos K j = 1 n x ( x j α ) 2
  • Discuss for x j [ 100 , 100 ] :
    f ( x ) = 10 6 x 1 2 + i = 2 n x x j 2
  • Drop Wave for x j [ 5.12 , 5.12 ] :
    f ( x ) = 1 + cos ( 12 j = 1 n x x j 2 ) 2 + 0.5 j = 1 n x x j 2
  • Egg Crate for x j [ 5 , 5 ] :
    f ( x ) = j = 1 n x x j 2 + 24 j = 1 n x sin ( x j ) 2
  • Egg Holder for x j [ 512 , 512 ] :
    f ( x ) = j = 1 n x 1 ( x j + 1 + 47 ) sin ( | x j + 1 + x j / 2 + 47 | ) x j sin ( | x j ( x j + 1 + 47 ) | )
  • Elliptic for x j [ 100 , 100 ] :
    f ( x ) = j = 1 n x ( 10 6 ) j 1 n x 1 x j 2
  • Exponential for x j [ 1 , 1 ] :
    f ( x ) = e 0.5 j = 1 n x x j 2
  • Giunta for x j [ 1 , 1 ] :
    f ( x ) = 0.6 + j = 1 2 sin ( 16 15 x j 1 ) + sin ( 16 15 x j 1 ) 2 + 1 50 sin ( 4 ( 16 15 x j 1 ) )
  • Holder Table 1 for x j [ 10 , 10 ] :
    f ( x ) = j = 1 n x cos ( x j ) e | 1 j = 1 n x x j / π |
  • Lanczos 3 for x j [ 20 , 20 ] :
    f ( x ) = j = 1 n x sinc ( x j ) sinc ( x j / k )
    where sinc ( x ) = s i n ( x ) x
  • Levy 3 for x j [ 10 , 10 ] :
    f ( x ) = sin 2 ( π y 1 ) + j = 1 n x 1 ( y j 1 ) 2 [ 1 + 10 sin 2 ( π y j + 1 ) ] + ( y n x 1 ) 2
    where
    y j = 1 + x j 1 4
  • Levy-Montalvo 2 for x j [ 5 , 5 ] :
    f ( x ) = 0.1 sin ( 3 π x 1 ) 2 + 0.1 j = 1 n x 1 ( x j 1 ) 2 sin 2 3 π x j + 1 + 1 + 0.1 x n x 1 2 sin 2 2 π x n x + 1
  • Michalewicz for x j [ 0 , π ] :
    f ( x ) = j = 1 n x sin ( x j ) sin j x j 2 π 2 m
    where m = 10 .
  • Mishra 1 for x j [ 0 , 1 ] :
    f ( x ) = 1 + n x j = 1 n x 1 x j ( n x j = 1 n x 1 x j )
  • Mishra 4 for x j [ 10 , 10 ] :
    f ( x ) = sin | j = 1 n x x j 2 | + 0.01 j = 1 n x x j
  • Needle Eye for x j [ 10 , 10 ] and e y e = 0.0001 :
    f ( x ) = 1 if | x j |   < e y e j j = 1 n x ( 100 + | x j | ) if | x j |   > e y e 0 otherwise
  • Norwegian for x j [ 1.1 , 1.1 ] :
    f ( x ) = j = 1 n x cos ( π x j 3 ) 99 + x j 100
  • Pathological for x j [ 100 , 100 ] :
    f ( x ) = j = 1 n x 1 sin 2 100 x j 2 + x j + 1 2 0.5 0.5 + 0.001 ( x j x j + 1 ) 4
  • Penalty 1 for x j [ 50 , 50 ] :
    f ( x ) = π 30 10 sin 2 ( π y 1 ) + j = 1 n x 1 ( y j 1 ) 2 1 + 10 sin 2 ( π y j + 1 ) + ( y n x 1 ) 2 + j = 1 n x u ( x j , 10 , 100 , 4 )
    where
    y j = 1 + 1 4 ( x j + 1 )
    and
    u ( x j , a , k , m ) = k ( x j a ) m if x j > a 0 if a x j a k ( x j a ) m if x j < a
  • Penalty 2 for x j [ 50 , 50 ] :
    f ( x ) = 0.1 sin 2 ( 3 π x 1 ) + j = 1 n x 1 ( x j 1 ) 2 1 + sin 2 ( 3 π x j + 1 ) + ( x n x 1 ) 2 1 + sin 2 ( 2 π x n x ) + j = 1 n x u ( x j , 5 , 100 , 4 )
    where
    u ( x j , a , k , m ) = k ( x j a ) m if x j > a 0 if a x j a k ( x j a ) m if x j < a
  • Periodic for x j [ 10 , 10 ] :
    f ( x ) = 1 + j = 1 n x sin 2 ( x j ) 0.1 e j = 1 n x x j 2
  • Pinter for x j [ 10 , 10 ] :
    f ( x ) = j = 1 n x j x j 2 + j = 1 n x 20 j sin 2 A + j = 1 n x j log 10 ( 1 + j B 2 )
    where
    A = x j 1 sin x j + sin x j + 1
    B = x j 1 2 2 x j + 3 x j + 1 cos x j + 1
  • Price 2 for x j [ 10 , 10 ] :
    f ( x ) = 1 + j = 1 n x sin 2 ( x j ) 0.1 e j = 1 n x x j 2
  • Qings for x j [ 500 , 500 ] :
    f ( x ) = j = 1 n x ( x j 2 j ) 2
  • Quadric for x j [ 100 , 100 ] :
    f ( x ) = j = 1 n x l = 1 j x l 2
  • Quintic for x j [ 10 , 10 ] :
    f ( x ) = j = 1 n x x j 5 3 x j 4 + 4 x j 3 + 2 x j 2 10 x j 4
  • Rana for x j [ 500 , 500 ] :
    f ( x ) = j = 1 n x 1 ( x j + 1 + 1 ) cos ( t 2 ) sin ( t 1 ) + x j cos ( t 1 ) sin ( t 2 )
    where
    t 1 = | x j + 1 + x j + 1 | and t 2 = | x j + 1 x j + 1 |
  • Rastrigin for x j [ 5.12 , 5.12 ]
    f ( x ) = 10 n + j = 1 n x x j 2 10 cos ( 2 π x j )
  • Ripple 25 for x j [ 0 , 1 ] :
    f ( x ) = j = 1 n x e 2 log 2 ( x j 0.1 0.8 ) 2 sin 6 ( 5 π x j ) )
  • Rosenbrock for x j [ 30 , 30 ] :
    f ( x ) = j = 1 n x 1 100 ( x j + 1 x j 2 ) 2 + ( x j 1 ) 2
  • Salomon for x j [ 100 , 100 ] :
    f ( x ) = cos ( 2 π j = 1 n x x j 2 ) + 0.1 j = 1 n x x j 2 + 1
  • Schaffer 4 for x j [ 100 , 100 ] :
    f ( x ) = j = 1 n x 1 0.5 + cos 2 ( sin ( x j 2 x j + 1 2 ) ) 2 0.5 ( 1 + 0.001 ( x j 2 + x j + 1 2 ) ) 2
  • Schubert 4 for x j [ 10 , 10 ] :
    f ( x ) = j = 1 n x j = 1 5 j cos ( ( j + 1 ) x j + j )
  • Schwefel 1 for x j [ 100 , 100 ] and α = π :
    f ( x ) = j = 1 n x x j 2 α
  • Sine Envelope for x j [ 100 , 100 ] :
    f ( x ) = j = 1 n x 1 0.5 + sin 2 x j 2 + x j + 1 2 0.5 ( 1 + 0.001 ( x j 2 + x j + 1 2 ) ) 2
  • Sinusoidal for x j [ 0 , 180 ] :
    f ( x ) = A j = 1 n x sin ( x j z ) + j = 1 n x sin ( B ( x j z ) )
  • Step Function 3 for x j [ 5.12 , 5.12 ] :
    f ( x ) = j = 1 n x x j 2
  • Stretched V Sine Wave for x j [ 10 , 10 ] :
    f ( x ) = j = 1 n x 1 ( x j 2 + x j + 1 2 ) 0.25 [ sin 2 ( 50 ( x j 2 + x j + 1 2 ) 0.1 ) + 0.1 ]
  • Trid for x j [ 20 , 20 ] :
    f ( x ) = j = 1 n x ( x j 1 ) 2 i = 2 n x x j x j 1
  • Trigonometric for x j [ 0 , π ] :
    f ( x ) = j = 1 n x n j = 1 n x cos ( x j ) + i ( 1 cos ( x j ) s i n ( x j ) ) 2
  • Wavy for x j [ π , π ] :
    f ( x ) = 1 1 n x j = 1 n x cos ( k x j ) e x j 2 2
  • Weierstrass for j m a x = 20 , a = 0.5 , b = 3 :
    f ( x ) = j = 1 n x j = 0 j m a x a j cos ( 2 π b j ( x j + 0.5 ) ) n j = 1 j m a x a j cos ( π b j )
  • Vincent for x j [ 0.25 , 10 ] :
    f ( x ) = j = 1 n x sin ( 10 log ( x ) )
  • Xin-She Yang 1 for ϵ i U ( 0 , 1 ) :
    f ( x ) = j = 1 n x ϵ i | x j | i
  • Xin-She Yang 2 for x j [ 2 π , 2 π ] :
    f ( x ) = j = 1 n x | x j | e j = 1 n x sin ( x j 2 )
Table A1. Function characteristics (C = continuous, NC = non-continuous, D = differentiable, ND = non-differentiable, S = separable, NS = non-separable, MM = multi-modal, UM = unimodal).
Table A1. Function characteristics (C = continuous, NC = non-continuous, D = differentiable, ND = non-differentiable, S = separable, NS = non-separable, MM = multi-modal, UM = unimodal).
FunctionEquationCont.Diff.Sep.Mod.
Ackley 1(A1)CDNSMM
Alpine 1(A2)CNDSMM
Bohachevsky1(A3)CDSMM
Bonyadi-Michalewicz(A4)CDNSMM
Brown(A5)CDNSUM
Cosine Mixture(A6)CDSMM
Deflected Corrugated Spring(A8)CDSMM
Discuss(A9)CDSUM
Drop Wave(A10)CDNSMM
Egg crate(A11)CDSMM
Egg Holder(A12)CNDNSMM
Elliptic(A13)CDSUM
Exponential(A14)CDNSUM
Giunta(A15)CDSMM
Holder Table 1(A16)CNDNSMM
Levy 3(A18)CDNSMM
Levy–Montalvo 2(A19)CDNSMM
Mishra 1(A21)CDNSMM
Mishra 4(A22)CNDNSMM
Needle Eye(A23)CNDSMM
Norwegian(A24)CDNSMM
Pathological(A25)CDNSMM
Penalty 1(A26)CDNSMM
Penalty 2(A27)CDNSMM
Periodic(A28)CDSMM
Pinter 2(A29)CDNSMM
Price 2(A30)CDNSMM
Qings(A31)CDSMM
Quadric(A32)CDNSUM
Quintic(A33)CNDSMM
Rana(A34)CNDNSMM
Rastrigin(A35)CDSMM
Ripple 25(A36)CDSMM
Rosenbrock(A37)CDNSMM
Salomon(A38)CDNSMM
Schubert 4(A40)CDSMM
Schwefel 1(A41)CDSUM
Sinusoidal(A43)CDNSMM
Step Function 3(A44)NCNDSMM
Trid(A46)CDNSMM
Trigonometric(A47)CDNSMM
Vincent(A50)CDSMM
Weierstrass(A49)CDSMM
Xin-She Yang 1(A51)NCNDSMM
Xin-She Yang 2(A52)CNDNSMM
Cross Leg Table(A7)NCNDNSMM
Lanczos 3(A17)CDNSMM
Michalewicz(A20)CDSMM
Schaffer 4(A39)CDNSMM
Sine Envelope(A42)CDNSMM
Stretched V Sine Wave(A45)CDNSMM
Wavy(A48)CDSUM

References

  1. Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
  2. Beielstein, T.; Parsopoulos, K.E.; Vrahatis, M.N. Tuning PSO Parameters Through Sensitivity Analysis; Technical Report Interner Bericht des Sonderforschungsbereichs (SFB) 531 Computational Intelligence No.CI-124/02; Universitätsbibliothek Dortmund: Dortmund, Germany, 2002. [Google Scholar]
  3. Van den Bergh, F.; Engelbrecht, A.P. A study of particle swarm optimization particle trajectories. Inf. Sci. 2006, 176, 937–971. [Google Scholar] [CrossRef]
  4. Bonyadi, M.R.; Michalewicz, Z. Impacts of coefficients on movement patterns in the particle swarm optimization algorithm. IEEE Trans. Evol. Comput. 2016, 21, 378–390. [Google Scholar] [CrossRef]
  5. Bratton, D.; Kennedy, J. Defining a standard for particle swarm optimization. In Proceedings of the IEEE Swarm Intelligence Symposium, Honolulu, HI, USA, 1–5 April 2007; pp. 120–127. [Google Scholar]
  6. Jiang, M.; Luo, Y.; Yang, S. Stochastic convergence analysis and parameter selection of the standard particle swarm optimization algorithm. Inf. Process. Lett. 2007, 102, 8–16. [Google Scholar] [CrossRef]
  7. Harrison, K.; Engelbrecht, A.P.; Ombuki-Berman, B. Self-adaptive particle swarm optimization: A review and analysis of convergence. Swarm Intell. 2018, 12, 187–226. [Google Scholar] [CrossRef]
  8. Harrison, K.; Engelbrecht, A.; Ombuki-Berman, B. Inertia Control Strategies for Particle Swarm Optimization: Too Much Momentum, Not Enough Analysis. Swarm Intell. 2016, 10, 267–305. [Google Scholar] [CrossRef]
  9. Engelbrecht, A.P. Roaming Behavior of Unconstrained Particles. In Proceedings of the BRICS Congress on Computational Intelligence and 11th Brazilian Congress on Computational Intelligence, Ipojuca, Brazil, 8–11 September 2013; pp. 104–111. [Google Scholar]
  10. Harrison, K.R.; Engelbrecht, A.P.; Ombuki-Berman, B.M. Optimal parameter regions and the time-dependence of control parameter values for the particle swarm optimization algorithm. Swarm Evol. Comput. 2018, 41, 20–35. [Google Scholar] [CrossRef]
  11. Ratnaweera, A.; Halgamuge, S.; Watson, H. Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients. IEEE Trans. Evol. Comput. 2004, 8, 240–255. [Google Scholar] [CrossRef]
  12. Leonard, B.J.; Engelbrecht, A.P. On the optimality of particle swarm parameters in dynamic environments. In Proceedings of the IEEE Congress on Evolutionary Computation, Cancun, Mexico, 20–23 June 2013; pp. 1564–1569. [Google Scholar]
  13. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  14. Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  15. Barsce, J.C.; Palombarini, J.A.; Martínez, E.C. Automatic tuning of hyper-parameters of reinforcement learning algorithms using Bayesian optimization with behavioral cloning. arXiv 2021, arXiv:2112.08094. [Google Scholar]
  16. Talaat, F.M.; Gamel, S.A. RL based hyper-parameters optimization algorithm (ROA) for convolutional neural network. J. Ambient Intell. Humaniz. Comput. 2022, 13, 3389–3402. [Google Scholar] [CrossRef]
  17. Liu, X.; Wu, J.; Chen, S. Efficient hyperparameters optimization through model-based reinforcement learning with experience exploiting and meta-learning. Soft Comput. 2023, 27, 7051–7066. [Google Scholar] [CrossRef]
  18. Wauters, T.; Verbeeck, K.; De Causmaecker, P.; Vanden Berghe, G. Boosting Metaheuristic Search Using Reinforcement Learning. In Hybrid Metaheuristics; Talbi, E.G., Ed.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 433–452. [Google Scholar]
  19. Shi, Y.; Eberhart, R. A modified particle swarm optimizer. In Proceedings of the IEEE International Conference on Evolutionary Computation, Anchorage, AK, USA, 4–9 May 1998; Volume 6, pp. 69–73. [Google Scholar]
  20. Clerc, M.; Kennedy, J. The Particle Swarm-Explosion, Stability, and Convergence in a Multidimensional Complex Space. IEEE Trans. Evol. Comput. 2002, 6, 58–73. [Google Scholar] [CrossRef]
  21. Sermpinis, G.; Theofilatos, K.; Karathanasopoulos, A.; Georgopoulos, E.F.; Dunis, C. Forecasting foreign exchange rates with adaptive neural networks using radial-basis functions and Particle Swarm Optimization. Eur. J. Oper. Res. 2013, 225, 528–540. [Google Scholar] [CrossRef]
  22. Poli, R. Mean and Variance of the Sampling Distribution of Particle Swarm Optimizers During Stagnation. IEEE Trans. Evol. Comput. 2009, 13, 712–721. [Google Scholar] [CrossRef]
  23. Poli, R.; Broomhead, D. Exact Analysis of the Sampling Distribution for the Canonical Particle Swarm Optimiser and Its Convergence During Stagnation. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, London, UK, 7–11 July 2007; Association for Computing Machinery: New York, NY, USA, 2007; pp. 134–141. [Google Scholar]
  24. von Eschwege, D.; Engelbrecht, A. A Cautionary Note on Poli’s Stability Condition for Particle Swarm Optimization. In Proceedings of the IEEE Swarm Intelligence Symposium, Mexico City, Mexico, 5–8 December 2023. [Google Scholar]
  25. Oldewage, E.T.; Engelbrecht, A.P.; Cleghorn, C.W. The merits of velocity clamping particle swarm optimisation in high dimensional spaces. In Proceedings of the IEEE Symposium Series on Computational Intelligence, Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar]
  26. Li, X.; Fu, H.; Zhang, C. A Self-Adaptive Particle Swarm Optimization Algorithm. In Proceedings of the International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 5, pp. 186–189. [Google Scholar]
  27. Dong, C.; Wang, G.; Chen, Z.; Yu, Z. A Method of Self-Adaptive Inertia Weight for PSO. In Proceedings of the International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; Volume 1, pp. 1195–1198. [Google Scholar]
  28. Xu, G. An Adaptive Parameter Tuning of Particle Swarm Optimization Algorithm. Appl. Math. Comput. 2013, 219, 4560–4569. [Google Scholar] [CrossRef]
  29. Hashemi, A.; Meybodi, M. A note on the learning automata based algorithms for adaptive parameter selection in PSO. Appl. Soft Comput. 2011, 11, 689–705. [Google Scholar] [CrossRef]
  30. Haarnoja, T.; Tang, H.; Abbeel, P.; Levine, S. Reinforcement Learning with Deep Energy-Based Policies. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1352–1361. [Google Scholar]
  31. Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  32. Ziebart, B.D.; Maas, A.; Bagnell, J.A.; Dey, A.K. Maximum Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd National Conference on Artificial Intelligence, Washington, DC, USA, 2008; Volume 3, pp. 1433–1438. [Google Scholar]
  33. Maei, H.R.; Szepesvári, C.; Bhatnagar, S.; Precup, D.; Silver, D.; Sutton, R.S. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation. In Proceedings of the 22nd International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2009; pp. 1204–1212. [Google Scholar]
  34. Kullback, S. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  35. Van der Stockt, S.A.; Engelbrecht, A.P. Analysis of selection hyper-heuristics for population-based meta-heuristics in real-valued dynamic optimization. Swarm Evol. Comput. 2018, 43, 127–146. [Google Scholar] [CrossRef]
  36. Grobler, J.; Engelbrecht, A.P.; Kendall, G.; Yadavalli, V.S.S. Alternative hyper-heuristic strategies for multi-method global optimization. In Proceedings of the IEEE Congress on Evolutionary Computation, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
  37. Grobler, J.; Engelbrecht, A.P.; Kendall, G.; Yadavalli, V. Multi-method algorithms: Investigating the entity-to-algorithm allocation problem. In Proceedings of the IEEE Congress on Evolutionary Computation, Cancun, Mexico, 20–23 June 2013; pp. 570–577. [Google Scholar]
  38. Grobler, J.; Engelbrecht, A.P.; Kendall, G.; Yadavalli, V. Heuristic space diversity control for improved meta-hyper-heuristic performance. Inf. Sci. 2015, 300, 49–62. [Google Scholar] [CrossRef]
  39. Nareyek, A. Choosing Search Heuristics by Non-Stationary Reinforcement Learning. In Metaheuristics: Computer Decision-Making; Springer: Boston, MA, USA, 2004; pp. 523–544. [Google Scholar]
  40. Burke, E.K.; Kendall, G.; Soubeiga, E. A Tabu-Search Hyperheuristic for Timetabling and Rostering. J. Heuristics 2003, 9, 451–470. [Google Scholar] [CrossRef]
  41. Wirth, C.; Fürnkranz, J. EPMC: Every Visit Preference Monte Carlo for Reinforcement Learning. In Proceedings of the 5th Asian Conference on Machine Learning, Canberra, Australia, 13–15 November 2013; Volume 29, pp. 483–497. [Google Scholar]
  42. Rummery, G.A.; Niranjan, M. On-Line Q-Learning Using Connectionist Systems; Technical Report CUED/F-INFENG/TR 166; Department of Engineering, University of Cambridge: Cambridge, UK, 1994. [Google Scholar]
  43. Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
  44. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  45. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Harley, T.; Lillicrap, T.P.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML’16, New York, NY, USA, 20–22 June 2016; Volume 48, pp. 1928–1937. [Google Scholar]
  46. Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. In Proceedings of the 31st International Conference on Machine Learning, Lille, France, 11 July 2015. [Google Scholar]
  47. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  48. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  49. Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 1582–1591. [Google Scholar]
  50. Olorunda, O.; Engelbrecht, A.P. Measuring exploration/exploitation in particle swarms using swarm diversity. In Proceedings of the IEEE Congress on Evolutionary Computation, Hong Kong, China, 1–6 June 2008; pp. 1128–1134. [Google Scholar]
  51. Cleghorn, C.W.; Engelbrecht, A. Particle swarm optimizer: The impact of unstable particles on performance. In Proceedings of the IEEE Swarm Intelligence Symposium, Athens, Greece, 6–9 December 2016; pp. 1–7. [Google Scholar]
  52. Goodfellow, I.J.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  53. Engelbrecht, A. Stability-Guided Particle Swarm Optimization. In Swarm Intelligence (ANTS); Springer: Cham, Switzerland, 2022. [Google Scholar]
  54. Harrison, K.R.; Engelbrecht, A.P.; Ombuki-Berman, B.M. An adaptive particle swarm optimization algorithm based on optimal parameter regions. In Proceedings of the IEEE Swarm Intelligence Symposium, Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–8. [Google Scholar]
Figure 1. Comparison of reward landscapes. (a) Absolute reward (Equation (23)). (b) Relative reward (Equation (25)). (c) Relative change (Equation (26)).
Figure 1. Comparison of reward landscapes. (a) Absolute reward (Equation (23)). (b) Relative reward (Equation (25)). (c) Relative change (Equation (26)).
Mathematics 12 03481 g001
Figure 2. Control parameter values for baseline. (a) Constant CP (note that plots for c 1 and c 2 coincides); (b) Time-variant CP; (c) Random CP.
Figure 2. Control parameter values for baseline. (a) Constant CP (note that plots for c 1 and c 2 coincides); (b) Time-variant CP; (c) Random CP.
Mathematics 12 03481 g002
Figure 3. Stable particles for baseline.
Figure 3. Stable particles for baseline.
Mathematics 12 03481 g003
Figure 4. Infeasible particles for baseline.
Figure 4. Infeasible particles for baseline.
Mathematics 12 03481 g004
Figure 5. Particle velocity vectors for baseline.
Figure 5. Particle velocity vectors for baseline.
Mathematics 12 03481 g005
Figure 6. Swarm diversity for baseline.
Figure 6. Swarm diversity for baseline.
Mathematics 12 03481 g006
Figure 7. Control parameter values for SAC.
Figure 7. Control parameter values for SAC.
Mathematics 12 03481 g007
Figure 8. Stable particles for SAC.
Figure 8. Stable particles for SAC.
Mathematics 12 03481 g008aMathematics 12 03481 g008b
Figure 9. Infeasible particles for SAC.
Figure 9. Infeasible particles for SAC.
Mathematics 12 03481 g009
Figure 10. Particle velocities for SAC.
Figure 10. Particle velocities for SAC.
Mathematics 12 03481 g010
Figure 12. Control parameter values for SAC (velocity clamped).
Figure 12. Control parameter values for SAC (velocity clamped).
Mathematics 12 03481 g012
Figure 13. Stable particles for SAC (velocity clamped).
Figure 13. Stable particles for SAC (velocity clamped).
Mathematics 12 03481 g013
Figure 14. Infeasible particles for SAC (velocity clamped).
Figure 14. Infeasible particles for SAC (velocity clamped).
Mathematics 12 03481 g014
Figure 15. Particle velocities for SAC (velocity clamped).
Figure 15. Particle velocities for SAC (velocity clamped).
Mathematics 12 03481 g015aMathematics 12 03481 g015b
Figure 16. Swarm diversity (log) for SAC (velocity clamped).
Figure 16. Swarm diversity (log) for SAC (velocity clamped).
Mathematics 12 03481 g016
Table 1. Comparison of RL algorithms. (Discr. = discrete action space, Cont. = continuous action space).
Table 1. Comparison of RL algorithms. (Discr. = discrete action space, Cont. = continuous action space).
AlgorithmDefinitionActionStateModelPolicy
MC [41]Every-Visit Monte CarloDiscr.Discr.Model freeEither
SARSA [42]State-action-reward-state-actionDiscr.Discr.Model freeOn-policy
Q-learning [43]State-action-reward-stateDiscr.Discr.Model freeOff-policy
DQN [44]Deep Q-NetworkDiscr.Cont.Model freeOff-policy
A3C [45]Asynchronous Advantage Actor-CriticCont.Cont.Model freeOn-policy
TRPO [46]Trust Region Policy OptimisationCont.Cont.Model freeOn-policy
PPO [47]Proximal Policy OptimisationCont.Cont.Model freeOn-policy
DDPG [48]Deep Deterministic Policy GradientCont.Cont.Model freeOff-policy
TD3 [49]Twin Delayed Deep Deterministic Policy GradientCont.Cont.Model freeOff-policy
SAC [14]Soft Actor-CriticCont.Cont.Model freeOff-policy
Table 2. SAC hyperparameters.
Table 2. SAC hyperparameters.
Discount factor γ 1Target smoothing coefficient τ 0.005
Reward scale factor f r 1Learning rate α 0.0001
Actor layer size256Replay buffer size 10 6
Critic layer size256Training steps 2 × 10 5
Table 3. Baseline normalised global best solution quality.
Table 3. Baseline normalised global best solution quality.
Training Testing
μ σ μ σ
baseline_tvac0.08830.18940.29260.2037
baseline_timevariant0.08880.19910.32870.3052
baseline_constant0.08930.18560.38020.2822
baseline_random0.10440.20320.38120.3125
Table 4. SAC-SAPSO normalised global best solution quality.
Table 4. SAC-SAPSO normalised global best solution quality.
Training Testing
μ σ μ σ
sac_ n t = 100.05890.13380.25850.2602
sac_ n t = 1250.06280.12990.18110.192
sac_ n t = 250.06870.14750.24150.1712
sac_ n t = 1000.06930.15990.23630.1905
sac_ n t = auto0.06930.15290.23730.1675
sac_ n t = 500.07370.16090.23420.1998
Table 5. SAC-SAPSO normalised global best solution quality (velocity clamped).
Table 5. SAC-SAPSO normalised global best solution quality (velocity clamped).
Training Testing
μ σ μ σ
vc_sac_ n t = 1250.04280.11150.09190.0892
vc_sac_ n t = 500.04310.1030.06890.0575
vc_sac_ n t = 1000.04730.11430.10890.0828
vc_sac_ n t = auto0.0490.1090.2280.1628
vc_sac_ n t = 100.05540.12190.19310.1902
vc_sac_ n t = 250.05590.12390.14430.1373
Table 6. Comparison of normalised global best solution quality.
Table 6. Comparison of normalised global best solution quality.
Training Testing
μ σ μ σ
vc_sac_ n t = 1250.04280.11150.09190.0892
vc_sac_ n t = 500.04310.1030.06890.0575
vc_sac_ n t = 1000.04730.11430.10890.0828
vc_sac_ n t = auto0.0490.1090.2280.1628
vc_sac_ n t = 100.05540.12190.19310.1902
vc_sac_ n t = 250.05590.12390.14430.1373
sac_ n t = 100.05890.13380.25850.2602
sac_ n t = 1250.06280.12990.18110.192
sac_ n t = 250.06870.14750.24150.1712
sac_ n t = 1000.06930.15990.23630.1905
sac_ n t = auto0.06930.15290.23730.1675
sac_ n t = 500.07370.16090.23420.1998
baseline_tvac0.08830.18940.29260.2037
baseline_timevariant0.08880.19910.32870.3052
baseline_constant0.08930.18560.38020.2822
baseline_random0.10440.20320.38120.3125
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

von Eschwege, D.; Engelbrecht, A. Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics 2024, 12, 3481. https://doi.org/10.3390/math12223481

AMA Style

von Eschwege D, Engelbrecht A. Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics. 2024; 12(22):3481. https://doi.org/10.3390/math12223481

Chicago/Turabian Style

von Eschwege, Daniel, and Andries Engelbrecht. 2024. "Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation" Mathematics 12, no. 22: 3481. https://doi.org/10.3390/math12223481

APA Style

von Eschwege, D., & Engelbrecht, A. (2024). Soft Actor-Critic Approach to Self-Adaptive Particle Swarm Optimisation. Mathematics, 12(22), 3481. https://doi.org/10.3390/math12223481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop