1. Introduction
The revolution in renewable energy is evolving substantially with the advancement of energy policies, the development of infrastructure, and the integration of continually improving power technologies. Renewable energy sources (RES), such as photovoltaics (PVs), wind energy conversion systems (WECSs), hydroelectric power stations (HPSs), and fuel cells (FCs), combined with various forms of energy storage devices, such as batteries and flywheels, are becoming increasingly dominant components of the power network. Power generation using distributed energy resources (DER) requires optimal deployment of extraction and control techniques for large-scale integration and transmission. Improving the stability and increasing the operational reliability of these microgrids is a topic of significance among many researchers around the world today [
1,
2].
Microgrids (MGs) generally operate in two main modes of operation: islanded or standalone and grid-connected. In islanded mode, central or decentralized control can be used to provide set voltage and frequency points, such as droop control or master–slave methods [
3]. However, when coupled to the upstream electricity network, the grid-connected inverter is programmed to follow the voltage and frequency of the grid using control methods such as voltage-oriented control [
2]. Faults in power systems can generally be categorized as symmetrical and asymmetrical faults. An asymmetric fault occurs when the voltage phases of the grid are unequally affected, creating an imbalance in the system, resulting in different fault currents in each phase. Due to the unbalanced nature of asymmetrical faults, they are more challenging to model and analyze. However, a symmetric fault affects the three phases of the grid voltage equally, causing the fault current in all three phases to be identical in magnitude and phase angle. Symmetrical faults are typically the most severe type and generate the highest fault currents, but occur less frequently than asymmetrical faults.
Most MGs connected to the grid are not of special concern for the conventional power network, as they cannot influence its operating parameters (voltage and frequency). However, when several MGs are integrated into a low- or medium-voltage network, the stability and reliability of the larger grid can be compromised [
4]. To ensure standard operation, virtual inertias are often incorporated into droop control loops, allowing the voltage and frequency magnitudes to be adjusted according to the active and reactive power generated by the inverter using low-bandwidth communication. This approach can effectively prevent voltage sagging while giving flexibility to transition between island-connected and grid-connected operations through a phase-locked loop (PLL). Droop control has a simple implementation; however, there are a few downfalls, such as a slow transient response, circulating current between converters, and an imprecise power distribution during grid faults [
5]. Furthermore, most standalone industrial grid-interfaced power converters are now required to have multiple fault ride-through (FRT) capabilities [
6]. For instance, IEEE Std. 2800-2022—“IEEE Standard for Interconnection and Interoperability of Inverter-Based Resources (IBRs) Interconnecting with Associated Transmission Electric Power Systems”—clearly specifies performance requirements for dynamic active and reactive power support under abnormal frequency or voltage, negative sequence current injection, and low voltage or high voltage ride-through operation. As such, the IBR plant must be able to switch between multiple modes of operation when required [
7].
Consequently, in the event of voltage imbalance faults at a particular node, the effect is transmitted to power transformers, converters, and the larger microgrid, eventually affecting the synchronization of inverters and sensitive loads. The protection system of MGs, whether in grid-connected or islanded operation, must respond to abnormal grid conditions. In comparison, the fault current in the connected mode of the grid is 10–50 times the full load per unit current due to the low impedance of the utility network, but in the islanded mode it is only 1.2–5 times higher [
8]. As an ancillary service, negative sequence currents need to be injected to compensate for any voltage imbalances at the node. A complex proportional resonant controller can be designed to achieve this by sharing negative sequence currents between participating inverters connected to an AC microgrid using a communication link that transmits at 10 Hz [
9]. However, the communication link is prone to failure and low-bandwidth data transfer can increase communication delays, resulting in deteriorated robustness during faults. In [
10], an improved inverter control technique based on a grid impedance estimation technique is proposed using a Newton–Raphson algorithm implemented in the stationary reference frame that takes into account positive sequence phasors as input. This method iteratively adjusts the virtual impedance and stabilizes the inverter under unbalanced and harmonic-inflicted grid voltages. Similarly, another method involves a positive–negative sequence synchro-converter capable of limiting output currents to reduce power oscillations when subjected to disturbances in the grid [
11]. When the grid voltages abruptly differ from the nominal operating set points, these control approaches have poor transients and a long settling time.
Moreover, to leverage the operation of smart inverters, low-voltage ride-through (LVRT) capability is essential. An autonomous model predictive controller (AMPC), which is an improvement over the classical MPC, can be adopted for adjustment of the injected active and reactive powers of a grid-connected qausi-Z-source inverter during such scenarios. Compared with conventional MPC, the trial-and-error design stage of the weight factor is replaced with an auto-tuning objective normalization function. An additional benefit of this method is the provision for a transition between maximum power point tracking (MPPT) and FRT operation with ease based on the command of the grid operator or the grid condition [
12]. Intrinsically, it can sometimes be difficult to reduce inverter currents during LVRT, as it adjusts the reactive current to support the voltage of the grid [
13]. As such, [
14] proposes a method to reduce limit fault currents using phase angle adjustment and achieve improved voltage support by embedding the network impedance in the current control loop. In other works, a bridge-type fault current limiter (BFCL) is connected between the distributed energy resources and the main grid at the point of common coupling (PCC) to limit fault currents and provide reactive power support through cooperative control with the voltage source converter [
15]. For a switched resistance wind generator (SRWG), an optimal proportional–integral (PI) voltage controller tuning method using the elephant herd algorithm (EHA) is another way to provide LVRT capabilities according to the grid code [
16]. Further analysis of other existing grid fault-ride-through techniques is presented in
Table 1.
Recent advances in computational power density have led to the rise of numerous applications of machine learning (ML) or artificial intelligence (AI) in demand-side energy management, distribution networks, and improving the stability of power converters in MGs [
24]. In an effort to advance the application of the Markovian jump stochastic neural network, a framework is particularly necessary to model abrupt variation of the system architecture and parameters in the context of stability and immunity to input disturbances [
25]. The analysis of the Cohen–Grossberg bidirectional associative memory neural network comprising time delay can be reconfigured to inverter control under uncertainties [
26]. In addition, synchronization of control systems techniques with sampled data prior to additive delay has also been investigated in detail [
27]. Such approaches can be applied to improve the reliability and synchronization of multiple grid-connected inverters for optimal power distribution. In addition, fast and accurate fault detection often relies on appropriate data selection and feature extraction methods. High-quality hybrid feature selection based on correlated items applied in the context of classification can be used to identify critical aspects of inverter performance by detecting voltage variation or unpredictable changes in load [
28,
29,
30]. Through careful evaluation and reduction of the feature space with a focus on the most relevant data points, the computational cost for training and deploying sophisticated algorithms such as the adaptive neural fuzzy interference system for strategic decision-making in the energy sector has also been progressively advanced due to improved overall reliability in dynamic environments [
31,
32].
In hierarchical control, machine learning-based approaches have induced significant improvements in power system performance at all levels (primary, secondary, and tertiary) and have been shown to be effective in load forecasting, generation prediction, and fault diagnosis [
33]. In a low-voltage grid, LVRT and constant reactive current injection during voltage sags can be achieved through online detection of grid abnormalities based on wavelet transforms and shallow neural networks and the resultant adjustments in the power references. The fault identification technique reported a classification precision of 98.3% with PCC voltage and frequency as input data [
34]. Terminal voltage regulation is crucial during transitions between grid-connected and standalone operations. A DG inverter can be controlled using a deep neural network (DNN) trained offline in conjunction with a proportional integral derivative (PID) controller to ensure a smooth transition. Additionally, a feedforward loop can be incorporated to mitigate harmonics for nonlinear loads [
35]. Battery energy storage systems (BESS) in MGs are commonly used for voltage support and fault compensation. To address the issue of short circuit-induced voltage sags in power lines using BESS in hybrid MG, a recurrent wavelet petri fuzzy neural network (RWPFNN) can be used in the control loop for LVRT and fast voltage restoration [
36]. Other studies have employed neural networks with an extended Kalman filter to handle the control of doubly fed induction generators during sensor failure on controlled variables of back-to-back power converters [
37].
In addition, reinforcement learning (RL) techniques have been extensively utilized to design better secondary control (power and frequency) of power inverters. For unstructured MGs with tidal power units and vehicle-to-grid connections, load frequency control is extremely important to restore stability. A fractional gradient descent based on a fuzzy logic controller can be implemented as the main frequency regulator with a deep deterministic policy gradient (DDPG) composed of an actor–critic architecture to produce additional control signals for frequency stabilization [
38]. In [
39], a quantum neural network (QNN) is formulated to cooperatively control the frequency of an isolated MG. It combines DRL and quantum machine learning to minimize the number of parameters and minimize training requirements for a 13-bus network. A data set containing the frequency response of unoptimized linear controllers is initially required to train the network through supervised learning. In [
40], a frequency regulation approach is proposed using deep reinforcement learning (DRL), considering both the frequency performance and the economic operational limits of the MG. The twin delayed deep-deterministic policy gradient (TD3PG) algorithm is implemented based on historical data for adaptive selection of the best commands to ensure frequency stability.
Moreover, a virtual inertia emulation technique is established in [
41] using the twin delayed deep deterministic policy gradient (TD3PG) reinforcement learning algorithm for frequency regulation in weak grids with reduced inertia and damping capability. Verification shows the ability of such controllers to better stabilize frequency in grid-connected and standalone MGs. In another study, a Kuramoto-based consensus algorithm is used to implement multi-agent reinforcement learning to handle frequency deviations. In [
42], an adaptive controller based on TD3PG is proposed to replace virtual synchronous generators (VSG) in multilevel modular converter control. This approach reported better system strength and higher resistance to disturbances than conventional VSGs. Reinforcement learning algorithms can also be used in designing parameters of existing controllers such as proportional–integral, proportional–resonant, and fractional-order proportional–integral regulators, etc. In [
43], a deep Q network is adopted to adjust the parameters of the proportional–integral controllers for a dual active bridge converter according to the modulation phase shifted angles. This allows real-time mapping of phase-shift angles to the necessary control parameters for stability.
Furthermore, reinforcement learning is applied to design Q-learning and TD3PG current controllers in variable-speed doubly fed induction generators for the rotor and grid-side inverters [
44]. The performance of both agents is evaluated under dynamic conditions, leading to the conclusion that Q-learning agents may not perform well under significant disturbances compared to TD3 agents. In another study, the TD3PG agent is utilized again to attain superior performance of permanent magnet synchronous machines (PMSM) governed using the classical field-oriented control scheme (FOC). The agent works in tandem with ESO-type speed observers, which allows sensorless FOC control through online estimation of the machine speed. Multiple neural networks are used to obtain the load torque with the TD3 agent performing a real-time estimated speed correction. The results indicate a better reference tracking capability with the use of the RL TD3PG agent [
45]. DDPG agents have also been implemented in a similar way and have been trained to work in parallel with the outer loop speed controller designed using sliding mode control (SMC) to improve the load torque rejection capability in PMSM [
46]. In this work, it is shown that RL agents have the ability to directly manipulate control signals to obtain the desired performance without a high computational or mathematical burden. Although there are extensive applications of RL agents and other intelligent control methods in machine control [
47], there is limited literature that focuses on designing robust and adaptive artificial-based controllers for inverters to withstand adverse grid disturbances.
With the increasing penetration of renewable energy into the electrical network and the increasing demand for resilient inverter-based technology, there is a growing need to develop advanced control methods to stabilize MGs under various dynamic conditions. The expanding nature of the incorporation of DERs with utility and standard ancillary practices makes the operation and control of voltage, frequency, and FRT mechanisms more challenging. AI has the capability to bring about a new frontier in power engineering with promising advances in processing density and improving deep learning algorithms [
48]. If implemented appropriately, a self-governing, highly intelligent controller capable of withstanding various grid abnormalities is definitely a possibility in MG applications. Although the above-mentioned model-free control methods deliver promising results, most depend on the type of data used during the training process. As such, considering deep learning methods that do not require an exact MG model for efficient FRT and have provision for transfer learning is desirable [
49]. Evidently, during grid faults, such as voltage swells, dips, unbalances, frequency disturbances, harmonics, asymmetrical, and symmetrical faults, the current control loop on IBRs is important in regulating power quality and ensuring stability. Most of the methods available in the literature often target compensating for one or a few types of fault using a particular approach, while some techniques require switching between multiple control modes depending on the induced fault. Some classical FRT controllers perform additional computations to generate negative sequence components for fault compensation; however, this requires exact plant modeling, which makes the design process monotonous.
In addition, the current loop with cross-coupling terms in conventional voltage-oriented control in inverter-based MGs causes instabilities during abnormal grid conditions, which are attributed to poor proportional–integral controller dynamics and saturation limits. Based on the aforementioned constraints and inspired by reinforcement learning techniques, a novel twin delayed deep deterministic policy gradient-based reinforcement learning control approach is proposed to provide effective FRT operation and ensure optimal current injection from RES. The main contributions of this research article are as follows.
A novel twin delayed deep-deterministic policy gradient agent is formulated and trained to generate the voltages of the direct and quadrature axes in real time for resilient operation of the inverter considering abnormal conditions of the grid based on a set of observed states from the environment developed in MATLAB/Simulink®.
A framework for training and deploying fully AI-based current controllers has been established to achieve the optimal response of grid-interfaced inverters using a model-free Markov decision process guided through a suitable reward function, thus eliminating the need to design controllers based on uncertain operating parameters and an accurate system model.
Initially, the agent is trained for a single asymmetric phase-to-phase fault and then applied to improve the inverter response for other asymmetric and symmetric grid faults using transfer learning. This feature allows the controller to continuously adapt to variations in operating conditions and be resilient despite deviations in grid voltages without requiring complex deep neural network architectures and minimizes computational or training costs.
Such control approaches have not been studied in depth in the field of inverter fault-ride operation. Therefore, this article focuses on developing the actor–critic-based data-driven control strategy, which has been thoroughly verified under asymmetric faults (L-L and L-L-G) and symmetric faults (L-L-L). The proposed fault ride-through approach is also compared with the standard positive sequence voltage-oriented control technique.
The organization of this paper is as follows.
Section 1 provides a detailed background and an extensive review of the literature on the subject.
Section 2 covers the description of the power conversion system, while
Section 3 describes the traditional positive sequence voltage-oriented control method. The novel design process for the twin delayed deep deterministic policy gradient agent, intended for grid fault ride-through, is detailed in
Section 4. This includes specifics on the architecture of the deep neural network, the formulation of the reward function, and the training procedure.
Section 5 presents a comparative evaluation of the proposed fault ride-through approach against the conventional control method. The research considers three distinct fault scenarios, with the DC link voltage response, post-fault recovery time, and cumulative integral-time error analysis serving as primary evaluation metrics.
Section 6 discusses the limitations of the study and future research opportunities, and the article is concluded in
Section 7.
2. System Description
The IEEE 13-bus network is a benchmark test case system used for research to implement new concepts in electricity generation, transmission, and distribution. For modeling purposes, single AC generators are represented as the swing voltage source with a fixed X/R ratio of 10. The classical 13 bus system has multiple predetermined single- and three-phase loads connected to all buses. For this study, an inverter is connected to an existing 4160/415 V transformer between buses 633 and 634 to facilitate the integration of a 100 kW solar system as shown in
Figure 1.
In the primary power conversion circuit (
Figure 2), a three-phase two-level inverter is connected to the IEEE 13 bus network between buses 633 and 634 with an LCL-type current harmonic filter. A DC/DC boost converter is used to increase the voltage of the PV panels to charge the DC link capacitors. The scheme follows a centralized inverter topology in which a single inverter encompasses the entire power conversion circuit and controls the power using a single DC link [
50]. The incremental conductance (INC) maximum power point tracking algorithm is used for optimum power extraction. INC requires measurement of the array voltage and current to calculate the instantaneous conductance and update the reference duty cycle accordingly. It has the benefit of faster adaptation to rapidly changing climate conditions and fewer oscillations when attaining the MPP. The size of the perturbation voltage step is set to
V to minimize tracking error after reaching the desired operating point. Other parameters of the power conversion circuit are stated in
Table 2.
For this study, as a prerequisite, voltages and currents are normalized to per unit values. The use of per-unit (p.u.) measurements significantly enhances scalability, making it easier to design, analyze, and operate systems of varying sizes and complexities with various control techniques. Using a base apparent power (
) of 100 kVA and a nominal AC phase-to-phase voltage (
) of 415 V, the voltage and current bases can be specified as
Thus, the instantaneous voltages and currents of the grid at each phase measured at PCC are independently converted to their equivalents per unit using Equation (
2). The transformed voltages and currents are then fed to a phased-locked loop to extract their
d and
q counterparts in normalized form.
Grid synchronization is crucial to maintaining quality power injection into the grid despite the volatile nature of the grid. As such, the double second-order generalized integrator phase-locked loop (DSOGI-PLL) is often used to synchronize the inverter voltages with the main grid. DSOGI-PLL has established superior performance in stabilizing the inverter under numerous disturbances of the grid [
51,
52]. It is based on an instantaneous symmetric component method with two inbuilt quadrature signal generators (QSG) to filter out additional harmonics present in the grid voltages in the first stage. Each QSG produces two voltages on the quadratic axis (
apart) that can be recombined to form positive and negative stationary reference frame grid components using Clarke’s transformation of the grid voltage at PCC (
and
). Since in the case of a grid, the frequency is constant under normal operating conditions, the filter bandwidth is essentially dependent on the damping factor (
), which is generally chosen as
. This makes it also suitable for variable-frequency applications [
53].
Taking
,
,
, and
as QSG output, the decoupled positive and negative sequence components of the grid voltage can be obtained by using the sequence calculation in Equation (
3). Thus, the corresponding positive and negative voltages on the
d and
q axes can be easily obtained by Park’s transformation of the respective stationary reference frame voltages with
as the estimated grid angle according to Equations (4) and (5).
gives the amplitude of the input voltage, while
gives information on the phase error. A PI controller is used as a loop filter to maintain
at zero and send it to the voltage-controlled oscillator represented by the integrator, thus allowing the phase angle of the input signal to be estimated [
54]. Transformation of three-phase voltage and currents from their natural states (
) into (
) synchronous rotating frame DC quantities allows decoupled control of grid-connected inverters in both positive and negative domains.
The approach can also be applied to the measured grid currents at PCC to obtain its positive and negative sequence components. With
and
as inputs to the dual QSG, the corresponding outputs
,
,
, and
can be recombined and produce the positive and negative current sequence components required for control purposes.
3. Conventional Voltage-Oriented Control
The positive sequence voltage-oriented control (PSVOC) is most famously used to interface photovoltaic inverters with the grid. It has two cascaded control loops; the inner loop regulates the inverter currents on the
d and
q axes, and the outer loop ensures the stability of the DC link voltage, as shown in
Figure 3. Normally, the DC link voltage reference is kept constant, and a PI controller is used to generate the direct axis current reference that corresponds to the net active power to be injected by the inverter into the grid [
55]. Transient conditions and variations in the DC bus voltage are regulated by the capacitor’s charge-and-discharge process. In a grid-connected situation, the voltage can fluctuate due to changes in solar irradiation or temperature levels of the photovoltaic array, as well as oscillations in the AC power caused by grid imbalances.
Injection of active power into the grid can be indicated by an increase in three-phase currents at the PCC, assuming that the grid voltage remains steady during normal operation. The change in DC link voltage is determined by the power balance between the solar panels and the currents injected into the grid. The reactive power supply can be commanded directly by specifying the required amount and calculating the current on the quadrature axis. The error for both currents can be processed by inner-loop dual PI controllers independently. Typically, the interconnecting filter transfer function is used to tune the proportional (
) and integral (
) controller gains. The final inverter reference voltage is the combined output of the current PI controllers, the grid voltages, and the cross-coupling terms [
56].
Decoupling allows independent control of active and reactive powers for grid-connected inverters. According to instantaneous power theory, the active and reactive powers on the AC side per unit can be evaluated using and , respectively. It is evident that the active and reactive power produced by the inverter is directly proportional to the direct- and quadrature-axis currents in the synchronous reference frame. Therefore, for the unity power factor, the reference reactive power () can be kept zero. The PI controllers employ a fixed frequency free-running carrier signal, resulting in a well-bounded current total harmonic distortion (THD) with constant frequency inverter operation. This aids in instantaneous takeover of current control and has the advantage of eliminating the effect of DC side ripple on the phase currents of the inverter.
The third harmonic injection pulse width modulation (THIPWM) strategy is highly desirable for producing IGBT gate pulses, as it can help improve inverter performance under conditions of low DC link voltage. In THIPWM, a sine wave with a higher oscillating frequency is added to the modulating signals, leading to a reduction in the peak of the resulting modulating signal (
,
, and
). With
,
, and
as the amplitude of the modulating signals of each phase and
as the amplitude of the third harmonic, the positive sequence inverter modulating signals with the embedded third harmonic are given in Equation (10).
is the amplitude of the third harmonic component, which is usually between 0.15 and 0.2 [
57]. The resulting modulating signals are compared with a higher-frequency triangular wave generator to produce logic pulses for semiconductor devices.
The gains of the PI controller in such control approaches directly affect the stability of the inverter and the response to abrupt changes in operating conditions. Although there are many methods for designing these controller parameters, such as Ziegler–Nichols and pole placement, many cannot guarantee the best performance during grid disturbances and nonlinear loading conditions. For a more robust operation, metaheuristic-based algorithms inspired by nature have gained popularity in inverter control applications [
58,
59,
60,
61]. For this article, the widely used genetic algorithm (GA) optimization algorithm is implemented to iteratively determine the best set of PI values for the regulation of DC link voltage and inverter current. GA has been adopted because it guarantees convergence given certain constraint bands and does not require knowledge of derivatives, as the algorithm can process with input–output mapping [
62]. For the inner current loop, the PI values depend on the transfer function of the LCL filter. With a cost function derived based on the integral time absolute error (ITAE), a population size of 10, and lower/upper limits for PI values as [0 500], 52 generations are sufficient for convergence to optimal values. Similarly, the DC link transfer function established in [
63] is utilized for optimization with the established parameters. In this case, a total of 107 generations are required to achieve convergence to the optimum values. Thus, the PI values of the best fit for the current controller are determined as
and
, while for the DC link PI controller,
and
. All controllers are equipped with anti-windups to reduce the effect of saturation and prevent large overshoots or oscillations.
4. Reinforcement Learning Agent Design for Grid Fault Ride-Through
Fault ride-through requirements state that all IBRs must remain connected to the grid for a certain duration and wait for clearance to confirm that the fault is temporary. From the inverter’s perspective, low voltage is simply a dip/sag in the voltage profile at PCC causing a drop in the computed d and q axes components. Depending on the percentage of dip and duration, the currents injected from the inverter may increase considerably. During faults, conventional PI controllers suffer from poor performance and are unable to track the designated current references in the synchronous reference frames. Prolonged symmetric or asymmetric faults can cause DC link spikes, overcurrents, and deviations of active and reactive powers. The dynamic response of the feedforward decoupled control approach also deteriorates during fluctuations in grid voltages due to the presence of voltage terms of the d and q axes in cross-coupling.
Although controller gains can be optimized, obtaining knowledge of the grid condition and incorporating it into the process is difficult due to the grid’s dynamic nature. Additionally, PI controllers are insensitive to changes in grid operating conditions and pose a difficulty in maintaining inverter stability during abnormal grid conditions due to predefined saturation limits. Due to the presence of the grid voltage in the cross-coupling in the inner control loop, the PI controller tends to increase the inverter modulating voltages, causing an increase in the supplied currents. Prolonged divergence of the inverter currents results in the output of the PI controller reaching maximum thresholds, and thus the inverter goes into overmodulation mode.
Most other types of controller, such as proportional resonant, deadbeat, and fractional-order PI controllers, are inherently designed to provide a steady-state response and may not respond rapidly to sudden variations during grid faults. Nonlinearities are often introduced in the system due to disturbances in grid voltage and frequency and cause responses that standard controllers struggle to handle effectively. Other advanced control techniques, such as sliding mode and model predictive control, also encounter issues during grid transients due to the chattering phenomenon, sensitivity to uncertain parameter variation, tuning complexities, and dependency on accurate system models. As such, a twin delayed deep-determined policy gradient (TD3PG) current controller is formulated and trained to provide effective current control of inverter-based resources under various severities of grid disturbances. The improved voltage-oriented control based on reinforcement learning for effective fault management of grid-connected inverters is depicted in
Figure 4.
Reinforcement learning is an effective tool that can greatly improve the design process of power converter controllers. It provides various benefits, particularly in intricate and dynamic environments where conventional control methods may face difficulties. RL algorithms are capable of continually learning and adapting to changing operational conditions, such as load variations or fluctuations in input voltage. This is particularly advantageous in the control of grid-feeding inverters, where the power network can be highly unstable. Inverters frequently demonstrate non-linear behavior, especially under fault conditions. RL agents are anticipated to perform well in handling such non-linearity without requiring explicit modeling, enabling more robust and flexible control mechanisms. Traditional control techniques typically depend on precise mathematical models of the inverter and its operational context. This model-free approach eliminates the need for an explicit system model. Instead, it derives the control strategy directly through interaction with the environment, making it ideal for systems where precise models are hard to develop, enhancing fault resilience.
Twin delayed deep deterministic policy gradient is an advanced reinforcement learning algorithm that serves as an upgrade to the deep deterministic policy gradient algorithm. RL can be applied to control systems to train deep neural networks by interacting with an environment, enabling the achievement of tasks that are difficult to achieve using traditional control methods. The actor–critic agent is widely used in these applications, where the actor neural network determines the actions to be taken and the critic network assesses the effectiveness of those actions based on the rewards received from the environment (
Figure 5).
Markov decision process (MDP) is a mathematical modeling tool that is used to address decision-making problems involving an RL agent and its interaction with the environment. It provides a framework for modeling sequential decision processes or a continuous action space in uncertain states. Optimizing an agent involves finding the best policy that maximizes the expected cumulative rewards for the agent using a suitable algorithm such as Q-learning or policy gradient. In MDP, states (
s) represent all the information an agent receives from the environment, actions (
a) are the array of numeric input (continuous or discrete) applied in the environment determined by the actor, and (
r) is the cumulative reward obtained for the applied action. During training, the actor maps the states and actions, eventually converging to a deterministic policy (
) after numerous iterations [
64].
4.1. Deep Neural Networks
To denote the primary indicators for selecting various actions in the subspace, [, , (), ()] are relevant states that are chosen as observations for the agent, while the modulation voltages of the inverter [, ] are two actions required from the actor network to provide enhanced stability to the inverter. The critic network has two feature input layers (FILs): one for the states and the other for the corresponding actions. Information measured via sensors is passed through two fully connected layers (FCLs), and the output is summed together, forming a single feedforward path. The rectified linear unit (ReLu) activation function performs a threshold operation before the data is inputted to the next FCL, where all negative values are set to zero. The final FCL outputs the required Q-value at each time step.
In a similar manner, the states are fed to the actor network using a FIL and passed through an FCL with 64 outputs, which are given two separate but identical paths with two FCLs. The outputs are summed and propagated through two more FCLs, and the final modulation voltages are produced. In the actor network, the hyperbolic tangent activation function (tanh) is employed between FCL due to its high performance and naturally normalizes inputs in the range [−1, 1], which act as saturation limits for direct and quadrature voltages. The deep neural networks of critics and actors are shown in
Figure 6. The ’Adam’ learning algorithm is used to train dual critic and actor networks, each with a total of 3100 and 5800 learnable parameters, respectively. In general, both networks follow a similar pattern in which the number of outputs in the FCL is progressively reduced up to the output layer. This prevents overfitting and ensures a more stable training convergence.
4.2. Reward Function
It is necessary to use a reward function that will allow the agent to learn the best possible policy by encapsulating a function that minimizes the inverter current deviation from the nominal set points at each discrete time step during abrupt disturbances on the grid side. The reward function is derived on the premise of effectively controlling the direct- and quadrature-axes currents based on the observed states. It is a combination of the measured
d and
q current errors with respect to the reference, the integral square error, and the negative sequence currents of the inverter, which must be minimized with any imbalance induced in the grid voltage. The total episode reward is the sum of all instantaneous rewards at each time step until the execution is terminated, given as
The reward function assigns different weights to different aspects of the reward, represented by the arbitrary gain values M, N, and O. In this case, N and O are set to −1 and M to −0.5. The weights are selected to guarantee quick convergence during training. The adjustable gains permit customization of the significance assigned to each part of the reward function. For example, the squared errors of the positive and negative sequence currents are assigned the same gains so that both errors are equally minimized as the injected inverter currents deviate during faults. In contrast, the cumulative error increase over time is assigned a lower weight to ensure that it does not affect the real-time inverter regulation under the usual grid conditions. In addition, an average reward is calculated based on the length of the score averaging window. To simulate the agent’s adaptation, a short-circuit fault is introduced between phases B and C at bus 633. The agent is expected to adjust its behavior based on the defined reward function to minimize current deviations in the inverter and improve the resilience of the inverter. All measurements are taken as per unit values.
4.3. TD3PG Training Process
The agent training process involves systematically updating the parameters of the actor–critic network. In TD3PG, two critic networks are present that help mitigate the issue of overestimation bias of the DDPG agent and also incorporate delayed updates of the target networks to stabilize convergence during training. The output of the critic networks, known as the Q-value, is an important component of the training process and acts as an indicator for the estimated cumulative reward given a set of observations and actions data influenced by certain policies. The complete agent training process is shown in
Figure 7, and the general agent training pseudocode is presented as Algorithm 1.
Algorithm 1: Pseudocode for training the TD3PG agent for inverter regulation |
|
Denoting the parameters of the actor network as
, the weight of the first critic network (
) as
, the weights of the second critic network (
) as
, and the parameters of the corresponding target network as
,
, and
. The actor is expected to maximize the expected cumulative long-term reward by minimizing the negative Q-value function given as
E represents the expectation operator that is associated with the stochastic property of the transition probability and policy,
. Assuming
is the output of the actor network given a state in a time interval (
) and specific weights and
is the calculated gradient of the first critic network with respect to the action, the gradient ascent can be used to update the parameters of the actor network using the chain rule applied to the expected return objective according to the actor parameters [
65].
Therefore, the new weights of the actor network can be calculated using the return objective and the actor learning rate (
) as follows.
Furthermore, critic networks are trained to reduce the Q-value output of critic networks (
and
) and target critic networks
and
. The targeted Q-value (
) is obtained by evaluating the minimum of the two Q-values of the target networks, which also include the discounted reward (
) and the predicted Q-values in the next state (
), commonly calculated using the Bellman equation. Minimum operation ensures reduced overestimation bias. A discount factor (
), generally between 0 and 1, determines the weight given to future rewards. High
values give more importance to compounded rewards, and a small value prioritizes immediate rewards, creating a trade-off between past and instant rewards.
The critic tends to reduce the difference between the target Q-value (from the Bellman equation) and the estimated Q-values, known as the temporal difference error (TDE), using a formulated loss function stated in Equation (
8). Given
M as the size of the mini-batch and the set of state-action data from the stored experiences in the memory at each time step, the critic networks obtain a better approximation for the true Q-value via minimization of this loss function.
Eventually, both parameters of the critic network are updated using gradient descent considering a reasonable learning rate (
).
The target networks (
,
, and
) are updated through a soft update process that involves slowly evolving the network parameters with the current policy network parameters. A predefined target smooth factor (
) is used to provide a more stable learning process.
During the training process, noise exploration data are often added to actions to allow for more robust policy development. Sampled Gaussian noise
characterized by a normal distribution with reasonable mean (
) and standard deviation (
) can be used to modify the selected actions before being applied.
In this way, the agent can explore various regions of the action space via controlled randomness. It prevents it from obtaining a suboptimal policy, which is especially crucial in discovering effective continuous-deterministic actions. However, high values of and lead to more exploration but may cause overfitting, resulting in less deterministic behavior. Therefore, it is important to maintain balance without compromising training or policy stability. In some cases, a standard deviation decay rate () is used, allowing controlled reduction in the amount of noise that is added as agents take progressive steps and helping to shift from exploration to exploitation.
Furthermore, the TDP3PG agent offers several notable advantages that make it highly suitable for improving the effectiveness of such control tasks. It is specifically designed to handle problems with continuous action spaces, such as the regulation of power converters. This design choice eliminates the need for sampling from a probability distribution, as the agent can directly output continuous actions. Consequently, the complexity of decision making is reduced. Additionally, the use of the minimum operation in the Bellman equation combined with clipped double Q-learning critic networks ensures a more stable training process by mitigating issues related to overestimated Q-values and training divergence. Moreover, delayed updates of critic networks, in comparison to the actor, result in decorrelated target Q-values. This contributes to reducing variance and facilitates efficient training [
41].
6. Limitations and Future Works
Despite the impressive fault-resistant capabilities of the TD3VOC, it should be noted that achieving the optimal policy for effective real-time inverter control still requires approximately 20–25 h of episodic iteration. The duration of training is predominantly influenced by the processor power, the number of CPU cores, and the available memory. In addition, the environment plays a significant role in determining the required computational power. In this study, we developed a highly detailed grid-interfaced photovoltaic inverter environment in MATLAB/Simulink® with which the RL agent interacted. An average model could also be used to train the agent; however, it is uncertain whether the same agent would perform well when faced with a more elaborate model that includes switching noise and variability in feedback signals.
The results presented in this article are obtained using the twin delayed deep-deterministic policy gradient agent; however, there is the possibility of exploring other continuous-time agents such as soft actor–critic, trust region policy optimization, model-based policy optimization, etc. A comparative study can be conducted to determine the best candidate for developing data-driven controllers based on training time, number of hyperparameter selections, and adaptation of various grid conditions. It will also be interesting to compare the formulated fault ride-through approach with other established methods such as sliding mode and model reference adaptive controllers.
The proposed agent-based current controller can be trained and implemented for various other grid-feeding inverter control applications in weak microgrids that require a robust fault ride-through mechanism, such as wind energy conversion systems, grid-forming battery energy storage systems, fuel cell systems, and other inverter control schemes featuring an inner current loop. This direct design method can also be applied to develop robust data-driven controllers for electrical machines that are controlled using cascaded field-oriented control or the direct torque control scheme to provide better torque handling capabilities, smoother acceleration, and increased energy efficiency.
In addition, the experimental validation of such advanced control techniques is still a subject of particular interest to numerous researchers today. Implementing RL agents for optimal regulation of the inverter during dynamic grid conditions will require an embedded system that involves a combination of advanced hardware and software features. It will be necessary to utilize high-performance microcontroller units (MCUs), digital signal processors (DSPs), or systems on chips (SoCs) capable of handling both RL algorithms and real-time inverter control with appropriate communication interfaces. It is suggested to use the Xilinx field-programmable gate array board, OPAL-RT technology, or DSPACE Microlabbox due to compatibility with MATLAB/Simulink®, customized software options, and expandable hardware. There is immense potential for RL-based current controllers to enhance the performance and reliability of inverter systems in various domains, particularly in the context of developing modern and intelligent energy systems.