Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Main Menu

Journal Search

[

Research article

]

The Transactions of the Korean Institute of Electrical Engineers

KIEE Vol. 74, No. 06, p.1041-1057

ISSN (print) :

1975-8359

ISSN (online) :

2287-4364

Received : 25 Oct. 2024Revised : 14 Mar. 2025Accepted : 13 May. 2025

DOI :

https://doi.org/10.5370/KIEE.2025.74.6.1041

심층 강화학습 기반 전력 시스템 제어 및 최적화 연구

A Survey on Deep Reinforcement Learning Approaches for Power System Control and Optimization

장호천 (Haotian Zhang) ¹iD 왕천 (Chen Wang) ¹iD 이민주 (Minju Lee) ²iD 이명훈 (Myoung Hoon Lee) ^††iD 문준 (Jun Moon) ^†iD

(Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of Korea.)
(KEPCO Research Institute, Daejeon Metropolitan City 34056, Republic of Korea.)

^†Corresponding Author : Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of Korea. E-mail : junmoon@hanyang.ac.kr

^††Corresponding Author : Department of Electrical Engineering, Incheon National University, Incheon 22012, Republic of Korea. E-mail : mh.lee@inu.ac.kr

License :

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0)which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Translated Abstract

With the increasing complexity of modern power systems due to the access of large-scale renewable energy sources, minimizing operational costs while achieving stable grid operation has become a core challenge in power scheduling and optimization. Energy dispatch, topology control and emergency load shedding are key measures to improve power system stability and flexibility. However, the outputs of their traditional control policies rely on predefined rules or mathematical optimization models, which are prone to computational bottlenecks and response lags in high-dimensional dynamic environments, making it difficult to meet the demands of smart grids. In recent years, deep reinforcement learning (DRL) has gradually become a cutting-edge technology for power system scheduling and control by virtue of its powerful adaptive learning and decision optimization capabilities. According to the existing research, DRL can improve the flexibility and anti-interference ability of the power grid by learning the optimal policies through autonomous interaction, surpassing the real-time decision-making ability of traditional optimization methods in high-dimensional state space. In this paper, we systematically review the applications of DRL in energy dispatch, topology control and emergency load shedding, focus on its optimization policies, technological breakthroughs and applicability, and analyze the current challenges and future research directions.

Key words

Deep reinforcement learning, energy dispatch, topology control, emergency load shedding

1. Introduction

With the large-scale popularization of renewable energy, the global energy structure is undergoing profound changes. Fossil energy sources are gradually being depleted, while advances in clean energy technologies have made renewable energy sources such as solar and wind energy progressively more economical than traditional energy sources. Innovative models such as microgrids and decentralized energy systems have facilitated energy penetration, accelerated energy transition ^[1], and promoted low-carbon development ^[2]. Modern energy systems are evolving towards multi-energy complementarity and smart low-carbon, with the deep coupling of electricity, natural gas, hydrogen, and other energy sources, enhanced energy storage and demand-side flexibility ^[3], promoting the reconfiguration of traditional energy systems. However, this transformation also brings great challenges, especially the volatility and instability of renewable energy. With large-scale access to variable renewable energy (VRE) sources such as wind and solar, the power system faces greater challenges in balancing supply and demand ^[4], especially when the traditional grid struggles to adapt to the growing electricity demand and the increased penetration of renewable energy sources, which exacerbates the risk of supply-demand imbalance ^[5]. In addition, the complexity of urban energy systems also puts higher demands on energy efficiency and policy assessment, and current modeling of urban energy systems still faces challenges in technical design, building design, urban climate, system design, and policy assessment, such as model complexity, data quality, and uncertainty ^[6].

In this context, how to ensure the security, stability and efficiency of the power system has become an urgent problem. Among them, energy dispatch, topology control, and emergency load shedding are the three core tasks to ensure the stable operation of the power grid. However, the traditional model-based control methods for these three tasks show obvious limitations in dealing with complex dynamic changes ^[7]. This is because power systems gradually exhibit nonlinearities, uncertainties, and stochasticity, which make it difficult for physical models to effectively deal with actual operating conditions. Moreover, under rapidly changing power demand and energy supply conditions, traditional methods may have lagged responses and fail to achieve optimal control, thus affecting the overall efficiency and reliability of the power grid ^[8]. For example, in ^[9], it is investigated to optimize the parameter configuration and siting of a hybrid energy storage system by constructing a simplified frequency response (SFR) model and combining it with an explicit gradient calculation method. In ^[10], the study constructs a modified SFR model and optimizes the inertia control and sag control parameters of the wind turbine to enhance the primary frequency regulation performance. In ^[11], a Fractional Order Proportional Integral Derivative-Fractional Order Proportional Integral (FOPID-FOPI) two-stage controller based on fractional order proportional integral-derivative (FOPID) is proposed for direct power control of DFIG-based wind energy systems. All of ^[9-^11] have achieved stable grid operation in terms of individual scheduling and direct control, but these methods still rely on physical modeling. In the face of these limitations, there is an urgent need for more flexible and efficient control and optimization tools to achieve stable power system operation ^[12] and to meet the challenges of the three core tasks of energy dispatch, topology control and emergency load shedding ^[13,^14].

Among these three tasks, energy dispatching requires precise generation planning to adapt to fluctuating electricity demand and unstable renewable energy supply; topology control optimizes the grid structure and dynamically adjusts the power flow to prevent transmission line overload and improve system security ^[15]; and emergency load shedding serves as the last line of defense for grid security by intelligently and selectively disconnecting some of the loads in extreme situations to prevent cascading failures from triggering large-scale blackouts ^[16]. These tasks are crucial for power system stability and security, but traditional methods are difficult to cope with the high dimensionality, dynamic changes and complex coupling relationships involved. Therefore, there is an urgent need to introduce intelligent optimization methods to provide more flexible and real-time solutions under uncertain environments to ensure the safe and efficient operation of power grids under complex conditions.

Deep reinforcement learning (DRL) ^[17] has emerged as a key research focus for addressing power system challenges. DRL algorithms are able to deal with complex dynamic environments in the power system and overcome the limitations of the traditional physical models in nonlinear and stochastic problems through a data-driven approach ^[18,^19]. It is worth noting that ^[18] mainly reviewed the application of DRL in power system optimization and control, emphasized the advantages of DRL over traditional physical model-based methods in dealing with complexity and uncertainty, and summarized the research progress of DRL in various fields, such as smart grids, demand-side management, and power markets. While this paper focuses more on the application of DRL in power scheduling, topology switching and emergency load shedding, focusing on its optimization policies and technological breakthroughs in high-dimensional dynamic environments and analyzing how to enhance the flexibility and real-time decision-making capability of power grids. In addition, DRL not only learns the scheduling and control policies automatically but also responds to the system changes in real-time, significantly improving the operational efficiency and responsiveness of the power system ^[20-^22]. Specifically，^[20] introduces a multi-agent DRL-based volt-VAR optimization (VVO) algorithm that optimizes scheduling and control policies, improving operational efficiency and responsiveness. ^[21] presents a data-driven, model-free DRL-based load frequency control (LFC) method, achieving faster response, stronger adaptability, and enhanced frequency regulation under renewable energy uncertainties. ^[22] develops adaptive DRL-based emergency control schemes, reinforcing grid security and resilience through robust policies for generator dynamic braking and under-voltage load shedding. Not only that, optimizing energy dispatch, topology control, and emergency load shedding with DRL effectively balances supply-demand fluctuations while flexibly managing grid dynamics to ensure stable system operation ^[23].

1.1 Main Contributions

The related review papers ^[18,^23] and ^[24] all provide a systematic overview of DRL applications in power systems, covering its basic principles, algorithm classification, and research progress in the fields of power dispatch, demand response, power market and operation control. Compared to this paper, ^[18,^23] and ^[24] focus more on the basic theory, algorithm development and overall application prospect of DRL.

Different from the existing related review studies, this paper systematically combs through the current research status of DRL in power systems and its applications, and thoroughly discusses the three key control means energy dispatch, topology control, and emergency load shedding in the stable operation of power systems. By comprehensively analyzing the existing research results and technological evolution, this paper summarizes the advantages of DRL in power system optimization, and at the same time reveals the limitations and challenges of the current research, and further proposes possible future research directions and improvement policies to promote the practical application and development of DRL in smart grids. The specific contributions are as follows:

1. Relevant methods of DRL and their applications in power systems are systematically summarized, covering three control tools for the stable operation of power systems: energy dispatch, topology control, and emergency load shedding, providing researchers with a clear overview of the current state of research.

2. Future research directions in the areas of energy dispatch, topology control, and emergency load shedding are explored with a focus on key challenges such as multi-task coordination, renewable energy integration, safety constraints, and Sim2Real migration. This paper proposes to enhance task optimization through multi-task reinforcement learning and hierarchical reinforcement learning, enhance DRL adaptation to renewable energy uncertainty by combining meta-learning and probabilistic modeling, and ensure grid security by using constraint-based reinforcement learning. Meanwhile, transfer learning and model-free DRL (Online Policy) are emphasized to bridge the gap between simulation and the actual grid, and to promote the secure deployment and application of DRL.

In the remainder of this paper, Section 2 will provide a comprehensive overview of the fundamentals of DRL and advanced techniques. Section 3 will describe the applications of DRL to three power system problems: energy dispatch, topology control, and emergency load shedding. Section 4 will discuss several potential future research directions. Finally, we conclude the paper in Section 5.

2. Review of Deep Reinforcement Learning

This section establishes the basic formulation of reinforcement learning (RL) problems and introduces key concepts such as the Q-function and the Bellman equation. These foundational concepts provide support for understanding subsequent algorithms. Next, we discuss classical RL algorithms, categorizing them into value-based and policy-based methods. Finally, we will introduce several advanced RL techniques, including DRL, Deterministic Policy Gradient, Actor-Critic methods, hierarchical RL, and Safe RL.

2.1 Reinforcement Learning

RL is a branch of machine learning that focuses on how an agent can make sequential decisions in uncertain environments to maximize cumulative rewards. Mathematically, the decision-making problem can be modeled as a Markov decision process (MDP), which consists of a state space $S$, an action space $A$, and a transition probability function $P(\bullet | s,\: a): S\times A → S$ that maps state-action pairs $(s,\: a)\in S\times A$ to the state space, along with a reward function $r(s,\: a): S\times A → R$.

In the MDP setting, the environment begins from an initial state $s_{0}\in S$. At each time step $t =\{0,\: 1,\: ...\}$, given the current state $s_{t}\in S$, the agent selects an action $a_{t}\in A$, and based on the current state-action pair $(s_{t},\: a_{t})$, receives a corresponding reward $r(s_{t},\: a_{t})$. Subsequently, the next state $s_{t+1}$ is randomly generated according to the transition probability $P(s_{t+1}| s_{t},\: a_{t})$. The agent’s policy $\pi(a | s)\in A$ is a mapping from the state $s$ to a distribution over the action space $A$, which specifies the actions that should be taken in a given state $s$. The goal of the agent is to find an optimal policy $\pi^{*}$, though this policy may not be unique. The MDP process is illustrated in Fig. 1.

Fig. 1. Decision process of Markov decision process.

2.1.1 Value Function and Optimal Policy

To maximize long-term cumulative rewards after the current time $t$, the return $R_{t}$ over a finite time horizon can be expressed as:

(1)

$R_{t}=r_{t}+\gamma R_{t+1}+\gamma^{2}R_{t+2}+... =\sum_{k=0}^{\infty}\gamma^{k}R_{t+k},\:$

where the discount factor $\gamma\in[0,\: 1]$ is a parameter used to discount future rewards.

To find the optimal policy, some algorithms rely on the value function $V_{\pi}(s)$, which represents the expected return when the agent reaches a given state $s$. This function depends on the agent’s actual policy $\pi$:

(2)

$V_{\pi}(s_{t})=E[R_{t}| s_{t}=s]=E[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k}| s_{t}=s].$

Similarly, the action-value function Q represents the value of taking action $a$ in state $s$:

(3)

$Q_{\pi}(s_{t},\: a_{t})=E[R_{t}| s_{t}=s]=E[\sum_{k=0}^{\infty}\gamma^{k}R_{t+k}| s_{t}=s,\: a_{t}=a].$

The value function $V$ and action-value function $Q$ can be expressed iteratively through the Bellman equation:

(4)

$Q_{\pi}(s_{t},\: a_{t})= r(s_{t},\: a_{t})+\gamma\sum_{s_{t+1}}P(s_{t+1}| s_{t},\: a_{t})V_{\pi}(s_{t+1}).$

The optimal policy $\pi^{*}$ is the policy that maximizes the long-term cumulative return, which can be expressed as:

(5)

$\pi^{*}=\max_{\pi}V_{\pi}(s_{t}){and}\pi^{*}=\max_{\pi}{Q}_{\pi}({s}_{{t}},\: {a}_{{t}}).$

2.2 Deep Reinforcement Learning

The journey from RL to DRL has undergone a long developmental process. In classical tabular RL such as Q-learning, the state and action spaces are usually small, allowing the approximate value function to be represented as a Q-value table. In this case, these methods are often able to find the exact optimal value function and policy ^[25,^26]. However, in many real-world problems, the state and action spaces are large or continuous, and the system dynamics are highly complex. Therefore, value-based RL struggles to compute or store a huge Q-value table for all state-action pairs.

To address this problem, researchers developed function approximation methods, using parameterized function classes such as linear functions or polynomial functions to approximate the Q function. For policy-based RL, finding an appropriate policy class to achieve optimal control is also crucial in high-dimensional complex tasks. With advances in deep learning, the use of artificial neural networks (ANN) for function approximation or policy parameterization has become increasingly popular in DRL. The theoretical foundations and historical evolution of deep learning ^[27], its breakthroughs in representation learning ^[28], and optimization techniques for training deep models ^[29] have all contributed to the widespread adoption of ANN in this field. Specifically, DRL can be implemented in the following ways:

(1) Value-based methods

In temporal difference (TD) learning ^[30] and Q-learning ^[31], a Q-network can be used to approximate the Q function. For TD learning, the update rule for parameters 𝑤 is:

(6)

$\omega\left. →\omega +\alpha(r_{t}+\gamma\phi Q_{\omega}(s_{t},\: a_{t})-\phi Q_{\omega}(s_{t},\: a_{t}))\nabla_{\omega}\phi Q_{\omega}(s_{t},\: a_{t}),\: \right.$

where the gradient $\nabla_{\omega}\phi Q_{\omega}(s_{t},\: a_{t})$ can be efficiently computed using backpropagation. In Q-learning, approximating using nonlinear functions (e.g., ANN) often results in instability and divergence during training. To address these issues, researchers developed Deep Q-Networks (DQN) ^[32], which significantly improved the stability of Q-learning through two key techniques.

▪ Experience replay: Instead of training on consecutive episodes, a widely used technique is to store all transition experiences $D =(s_{t},\: a_{t},\: r_{t},\: s_{t+1})$ in a database called the replay buffer $D$. At each step, a batch of transitions is randomly sampled from the replay buffer $D$ for Q-learning updates. This method improves data efficiency by recycling past experiences and reduces the variance in learning updates. More importantly, uniformly sampling from the replay buffer breaks the temporal correlation, enhancing the stability and convergence of Q-learning.

▪ Target network: Another technique is to introduce a target network $Q_{\omega}^{-}(s,\: a)$, which is a clone of the Q-network $Q_{\omega}(s,\: a)$. The parameters $\omega$ of the target network remain frozen during training and are only updated periodically. Specifically, a batch of transitions $(s_{i},\: a_{i},\: r_{i},\: s_{i+1}| i=0,\: 1,\: ...)$ is sampled from the replay buffer $D$ for training, and the Q-network is updated using the following formula:

(7)

$\omega\left. →\arg\min_{\omega}\sum_{i=1}^{n}(r_{i}+\gamma\max_{a}Q_{\omega_{i+1}}^{-}(s_{i+1},\: a)-Q_{\omega}(s_{i},\: a_{i}))^{2}.\right.$

This optimization process can be seen as finding an approximate solution to satisfy the Bellman optimality equation ^[33]. The key is to use the target network $Q_{\omega}^{-}$ with parameters $\omega_{i+1}$ to compute the maximum action value, rather than using the Q-network $Q_{\omega}$ directly. After a fixed number of updates, the parameters $\omega_{i+1}$ are replaced with the newly learned $\omega$ to update the target network. This technique mitigates instability and prevents short-term oscillations during training.

Moreover, several notable DQN variants can further improve performance, such as Double DQN ^[34] and Dueling DQN ^[35].

Double DQN addresses the overestimation bias in DQN by learning two sets of Q functions—one for selecting actions and the other for evaluating their values. Dueling DQN introduces a novel network architecture that decomposes the estimation of the Q-value into two parts: one part estimates the state value function $V(s)$, and the other estimates the action advantage function $A(s,\: a)$, which is dependent on the state. Through this decomposition, the dueling network allows the agent to better assess the importance of different states during the learning process, even when the choice of actions does not directly affect the learning. This architecture helps improve learning efficiency, particularly in situations where the Q-values of different actions have minimal differences in certain states.

(2) Policy-based methods

Due to their strong generalization capabilities, artificial neural networks are widely used to parameterize control policies, especially when state and action spaces are continuous. The resulting policy network $NN(a | s ;\theta)$ takes the state as input and outputs the probabilities of selecting actions. In Actor-Critic methods, both a Q-network $NN(s ,\: a ;\omega)$ and a policy network $NN(a | s ;\theta)$ are typically used, where the "actor" updates the parameters $\theta$ according to the policy, and the "critic" updates the parameters $\omega$ according to the Q-values. The gradient of an ANN can be efficiently computed using backpropagation ^[29].

When using function approximation, theoretical analyses of value-based and policy-based RL methods are relatively scarce and are usually limited to linear function approximations. Furthermore, one major challenge for value-based methods when dealing with large or continuous action spaces is the difficulty of executing the maximization step. For instance, when using deep artificial neural networks to approximate the Q function, finding the optimal action $a$ is not trivial due to the nonlinearity and complexity of $Q_{\omega}(s,\: a)$.

2.3 DRL Related Algorithms

In this subsection, we will explore several advanced DRL algorithms, including Deep Deterministic Policy Gradient (DDPG), Actor-Critic (AC), Proximal Policy Optimization (PPO), Hierarchical Reinforcement Learning (HRL), and Safe Reinforcement Learning (SRL). These algorithms demonstrate exceptional performance in handling complex environments and tasks, driving the application and development of DRL. The framework structure of the DRL algorithm is shown in Fig. 2.

Fig. 2. Classification of Deep Reinforcement Learning Algorithms by Framework: Value-based, Policy-based, and Actor-Critic-based.

2.3.1 Deterministic Policy Gradient (DPG)

Most reinforcement learning algorithms focus on stochastic policies $a\sim\pi_{\theta}(a | s)$, but deterministic policies $a =\pi_{\theta}(s)$ ^[36] are more suitable for many real-world control problems with continuous state and action spaces. This is because, on the one hand, many existing controllers for physical systems (such as PID and robust control) are deterministic, making deterministic policies a better match for practical control architectures, especially in power system applications. On the other hand, deterministic policies are more sample-efficient, as their policy gradient only integrates over the state space, while the gradient of stochastic policies integrates over both the state and action spaces. Similar to stochastic policies, deterministic policies also have a policy gradient theorem ^[37], which expresses the gradient as:

(8)

$\nabla J(\theta)=\sim\nabla_{\theta}\pi_{\theta}(s)\nabla_{a}Q_{\pi_{\theta}}(s,\: a).$

A key issue with deterministic policies is the lack of exploration due to deterministic action selection. A common approach to address this is to apply exploration noise to the deterministic policy, such as adding Gaussian noise $\xi$ to the policy $a =\pi_{\theta}(s)+\xi$.

2.3.2 Actor Critic Methods

Actor critic algorithms ^[38] combine the advantages of policy gradient and value iteration. In this framework, the actor and critic networks perform different functions. Specifically, the actor network is responsible for policy optimization, outputting the action $a\sim\pi_{\theta}(a | s)$ for a given state by directly generating actions using the parameterized policy $\pi_{\theta}$. The Critic network estimates the state-action value function $Q_{\pi}(s,\: a)$ or advantage function $A_{\pi}(s,\: a)$, providing feedback to guide the actor's optimization. During policy updates, the actor adjusts its parameters based on feedback from the critic, with the update following the policy gradient theorem defined as:

(9)

$\theta\left. →\theta +\alpha\nabla_{\theta}\log\pi_{\theta}(a | s)Q_{\pi}(s,\: a),\: \right.$

where $\nabla_{\theta}\log\pi_{\theta}(a | s)$ represents the policy gradient, and $Q_{\pi}(s,\: a)$ estimated by the critic, guides the actor's action adjustments.

Despite the success of Actor-Critic methods in many complex tasks, they are prone to issues like high variance, slow convergence, and local optima. Therefore, various variants have been developed to improve their performance, including:

Advantaged Actor-Critic (A2C): A2C ^[39] introduces the advantage function $A(s,\: a)=Q_{\pi_{\theta}}(s,\: a)- V(s)$, where the Q-function is replaced by the difference between the state-action value and the state value function $V(s)$, reducing the variance of the policy gradient and improving stability.

Asynchronous Actor Critic (A3C): A3C ^[40] improves sample efficiency and stability by training multiple agents in parallel. Each agent interacts with the environment using different exploration policies and the global parameters are updated synchronously across agents, enhancing convergence speed and performance.

Soft Actor Critic (SAC): SAC ^[41] operates under the maximum entropy RL framework, using stochastic policies and incorporating an entropy term $H(\pi_{\theta}(\bullet | s_{t}))$ in the objective function to encourage exploration and improve policy robustness, while maintaining efficient learning.

2.3.3 Proximal Policy Optimization (PPO)

PPO ^[42] is a policy gradient-based optimization method that balances stability and sample efficiency, widely applied in tasks involving both continuous and discrete action spaces. The core idea of PPO is to limit the magnitude of policy updates, preventing the policy from collapsing due to large updates and ensuring a smoother training process. The objective function in PPO is based on clipping, which limits the magnitude of changes between the old and new policies. The objective function is:

(10)

$L^{clip}(\theta)=E_{t}[\min(r_{t}(\theta)\hat{A_{t}},\: clip(r_{t}(\theta),\: 1-\epsilon ,\: 1+\epsilon)\hat{A_{t}})],\:$

where $r_{t}(\theta)=\dfrac{\pi_{\theta}(a | s)}{\pi_{\theta_{old}}(a | s)}$ is the probability ratio between the new and old policies, $\hat{A_{t}}$ is the estimated advantage function, and $\epsilon$ is a hyperparameter controlling the update range. By using this clipping mechanism, PPO ensures that the magnitude of policy updates remains within a specified range, preventing the risk of policy collapse.

Despite PPO's strong performance, policy exploration remains a challenge in high-dimensional and complex tasks. To ensure a broad exploration, entropy regularization can be introduced to PPO, maintaining the randomness of the policy and preventing premature convergence to local optima.

2.3.4 Hierarchical Reinforcement Learning (HRL)

HRL ^[43] improves learning efficiency and addresses complex problems by decomposing tasks into multiple hierarchical sub-tasks. The key idea of HRL is to introduce agents at different levels, where higher-level agents select abstract sub-tasks or "options," while lower-level agents focus on executing these sub-tasks. This hierarchical structure effectively reduces the decision space, speeds up convergence, and performs well in long-term decision-making problems.

One significant advantage of HRL is that it allows agents to learn at different levels, improving the generalization ability of the policy. This structure is particularly suitable for complex tasks in power systems, such as topology control and emergency load shedding. The high-level policy determines whether to adjust the topology or shed load during grid failures, while the low-level policy focuses on specific actions, such as shutting down transmission lines or disconnecting loads. By task decomposition, HRL significantly reduces the complexity of learning and enhances the system's response efficiency and stability.

However, HRL faces challenges such as defining and designing appropriate hierarchies and sub-tasks. Overly complex hierarchies may destabilize the training process, while overly simple hierarchies may not fully leverage the advantages of hierarchical policies. Coordination between high-level and low-level policies is also a challenge, ensuring effective collaboration between different levels of policies in long-term and short-term goals. Furthermore, HRL typically requires longer training times, making training efficiency a bottleneck.

2.3.5 Safe Reinforcement Learning (SRL)

SRL ^[44] focuses on ensuring system safety during the RL process, particularly in high-risk fields like power system control, where unsafe decisions must be avoided. In traditional RL, agents explore and interact with the environment, trying different policies to maximize cumulative rewards. However, this free exploration can lead to unsafe behaviors, especially in power systems, where unsafe actions may cause instability, equipment damage, or serious service disruptions. SRL aims to optimize long-term returns while constraining agent behavior, ensuring that safety limits are not violated during learning and in the final policy.

SRL methods often introduce constraints to ensure safety. A common approach is to embed safety constraints into the optimization objective, forming a constrained RL problem. The agent must not only maximize rewards but also satisfy specific safety constraints. These constraints can be enforced through penalty mechanisms, where the agent is penalized for taking unsafe actions, forcing it to optimize behavior to avoid future constraint violations.

Another method is to adopt a safe exploration policy, limiting

the agent's action space during exploration to ensure that dangerous behaviors are not executed. For example, model-based SRL methods build an environment model to predict potential outcomes under different policies and avoid executing high-risk actions in advance.

SRL methods offer significant advantages in many applications. First, they ensure that agents follow safety constraints during both learning and policy execution, avoiding dangerous behaviors. This is particularly useful in high-risk fields, where SRL can effectively reduce system failures and losses by balancing policy optimization and safety constraints. Moreover, SRL's safe exploration mechanisms restrict unsafe operations during training, preventing system damage from improper exploration. SRL also improves the robustness of agents, helping them maintain stable performance in uncertain and unexpected environments.

However, implementing SRL also poses challenges. Designing appropriate safety constraints is a key challenge—constraints that are too strict can limit the agent's learning space, while overly loose constraints may lead to safety risks. SRL also faces computational pressures in high-dimensional dynamic environments, especially in complex systems where computing resources and training time become bottlenecks. Furthermore, safety constraints can limit the agent's exploration ability, affecting the efficiency of policy optimization and convergence speed. Therefore, balancing exploration and exploitation while ensuring safety remains a key challenge in SRL implementation.

3. Application in Power System

In recent years, DRL has been widely used in maintaining power system stability. DRL techniques enable more efficient optimization of the execution of energy dispatch, topology control, and emergency load shedding, which are key measures for ensuring power system stability. Compared with traditional control methods, DRL can provide more accurate control and significantly improve system stability and response efficiency when dealing with complex grid environments. In the following, we will introduce the specific applications of DRL in the above tasks respectively. Table 1 in Appendix summarizes the applications of DRL algorithms in power systems.

3.1 Energy Dispatch

The growth of distributed energy and electric vehicles makes balancing power supply and demand more critical, while grid scaling increases uncertainty. DRL, as a data-driven method, offers adaptability by optimizing energy dispatch without relying on precise models, learning through interaction with the grid. Fig. 3 visually illustrates the energy dispatch problem in a small power system, covering key elements such as renewable energy sources, generating stations, loads, and energy storage devices. The core objective of using DRL in this system is to ensure a balance between supply and demand and to minimize energy losses. Fig. 4 further details the operational flow of the DRL methodology for optimizing the energy dispatch problem and thereby reducing energy losses.

Fig. 3. Power system energy dispatch of [Fig. 7, ¹⁴].

Fig. 4. DRL-based methods for optimal energy dispatch in power systems [Fig. 1, ⁴⁸].

Literature ^[45] models the power scheduling process as a dynamic sequential control problem, and proposes an optimized coordinated scheduling policy to cope with wind and demand perturbations by modeling the Markov decision process and combining Monte Carlo methods and Q-learning RL algorithms in order to reduce the long term operation and maintenance costs. Literature ^[46] employs an improved DRL technique of deep deterministic policy gradient (DDPG) to address the limitations of existing dispatch schemes that rely on forecasting and modeling by modeling the dynamic dispatch problem as an MDP, thus achieving an adaptive response to the uncertainty of renewable energy generation and demand fluctuations in an integrated energy system (IES). Literature ⁴⁷ proposes a robust and scalable deep Q-learning (DQN) based DRL optimization algorithm to solve the problem of balancing economic cost and environmental emission in renewable energy integrated electricity scheduling by decomposing the power scheduling problem into MDPs and training DRL models in a multi-intelligence body simulation environment. Literature ^[48] proposes a soft actor-critic (SAC) based autonomous control approach to address the challenges posed by large-scale renewable energy integration for active power scheduling in modern power systems by introducing Lagrange multipliers and imitation learning, which significantly improves the consumption rate of renewable energy sources and the robustness of the algorithm. It is worth noting that literature ^[45]-^[48] has not compared traditional control methods, but their performance in grid control is still excellent. In order to effectively demonstrate the advantages of DRL, literature ^[49], ^[14] are compared and analyzed with traditional methods. Literature ^[49] proposes an optimal scheduling method based on asynchronous advantage actor-critic (A3C) DRL algorithm, which can cope with the uncertainty of renewable energy sources and users' energy demand in the IES by constructing the state space, action space, and reward function, thus realizing the economic scheduling of the system and the complementarity of multiple energy sources. It is worth stating that literature ^[49] compares with traditional methods and achieves the same performance as mathematical planning methods. Literature ^[14] adopts a SAC-based DRL method to solve the energy dispatch problem in distributed grids by optimizing the reward function related to energy dispatch to find the optimal policy. Experimental results show that the method is able to achieve similar results with the traditional model predictive control (MPC) algorithm in a 6-bus power system.

Table 1 Task types and applications of DRL algorithms in power systems (Energy dispatch: ED / Topology control: TC / Emergency load shedding: ELS).

Literature	Field	Algorithm	Type	Objective	Improvement
⁴⁵	ED	Q-learning	Value-based	Cost reduction	Distinguishing from traditional short-term cost optimization, the model provides long-term optimal scheduling.
⁴⁶	ED	DDPG	Actor-Critic	Improving adaptive responses to uncertainty in the grid	The method is more adaptable than traditional methods, as it does not need to rely on predictive information or knowledge of the system's uncertainty distribution.
⁴⁷	ED	DQN	Value-based	Cost reduction	The model can handle multiple objectives to accommodate the complexity of modern power systems.
⁴⁸	ED	SAC	Actor-Critic	Cost reduction	The model enhances the robustness of the system and reduces the consumption rate of renewable energy.
⁴⁹	ED	A3C	Actor-Critic	Improving adaptive responses to uncertainty in the grid	The model outputs dispatch policies in real time, which avoids relying on accurate source load forecasts and effectively copes with source load uncertainty and volatility.
¹⁴	ED	SAC	Actor-Critic	Cost reduction	The model is enhanced for the DRL policy and enhances the security of the policy.
⁵⁰	TC	DDDQN	Value-based	Safe operation of the power grid	This literature suggests applying the DRL algorithm to larger, more constrained power systems and incorporating more control variables (e.g., line switching, transformer regulation, generation scheduling).
⁵¹	TC	Cross-Entropy	Policy-based	Safe operation of the power grid	The model further analyzes the topology control behavior of the agents and illustrates the ability to improve the generalization of the model while maintaining the simplicity of the approach.
⁵²	TC	AC	Actor-Critic	Improve topology control accuracy	The model effectively solves the problem of sparse rewards and high-dimensional state spaces in the grid environment.
⁵³	TC	SAC	Actor-Critic	Improve topology control accuracy	The model improves the accuracy of the DRL algorithm by introducing an attention mechanism.
⁵⁴	TC	SAC	Actor-Critic	Safe operation of the power grid	The model designs a pre-training scheme for the SAC algorithm to improve the robustness and efficiency of the algorithm.
⁵⁵	ELS	PARS	Actor-Critic	Improving adaptive responses to uncertainty in the grid	The model utilizes the derivative-free nature and parallelism of the proposed algorithm to substantially improve the training efficiency.
⁵⁶	ELS	DDDQN	Value-based	Improve emergency load shedding accuracy	The model effectively extracts the topological features of the grid, optimizes the emergency control policy, and improves the stability and economy of the grid under frequent topological changes.
⁵⁷	ELS	DDPG	Actor-Critic	Safe operation of the power grid	The model improves the training process and enhances the generalization of the control policy by introducing voltage information as a reward function.
⁵⁸	ELS	DQN	Value-based	Improve emergency load shedding accuracy	The model improves the system's adaptability and online operational performance in unseen scenarios through spatio-temporal information modeling and the application of control policies.
⁵⁹	ELS	PPO	Policy-based	Improving adaptive responses to uncertainty in the grid	The model enhances the frequency recovery capability of the system through DRL optimized control policy and can avoid triggering the system safety constraints.
⁶⁰	ELS	DDQN	Value-based	Improve emergency load shedding accuracy	The model improves training efficiency and decision quality through knowledge-enhanced DRL.

In summary, the application of DRL in power system scheduling significantly improves the adaptive capability and robustness of the system, enabling it to effectively cope with fluctuations in renewable energy sources and demand uncertainty. By optimizing the scheduling policy and reducing the reliance on complex mathematical models, DRL provides new solutions for achieving cost-effective energy management.

3.2 Topology Control

Topology control is crucial for maintaining power system stability by dynamically adjusting transmission line connections, optimizing current distribution, and reducing load on specific lines or nodes. The challenge lies in quickly finding the optimal solution within the complex topology while meeting real-time regulation and stability requirements. Fig. 5 illustrates a topology control problem for a small power system consisting of substations, loads, generators, and energy storage devices. Topology control aims to optimize system operation by adjusting the configuration of power lines or substations to improve power supply reliability and efficiency. Fig. 6 further illustrates how the DRL operates in the topology control task, Literature ^[50] uses a DRL method based on dueling double where the topology control is mainly optimized by dynamically adjusting the substation configuration.

Fig. 5. Power system topology control of [Fig. 1, ⁵⁰].

Fig. 6. DRL-based methods for topology control problems in power systems [Fig. 2, ⁵²].

deep Q-Learning (DDDQN) and prioritized experience playback to achieve secure power system operation through autonomous topology adjustment, which solves the problem of traditional methods that are difficult to cope with complex grid control. Literature ^[51] uses Cross-Entropy Method RL approach to train artificial intelligent agents to control power flow in the power grid through topology switching operations to solve the stability problem of power grid operation under uncertain generation and demand conditions. Facing the problem of high-dimensional topological space in power systems, literature ^[52] proposes a HRL-based method for grid topology regulation. The method extends the actor-critic model in DRL to a hierarchical structure, where the upper layer generates a desired topology configuration scheme based on the current state of the grid, and the lower layer is responsible for executing specific policies to achieve this goal. With this hierarchical architecture, the complexity of the high-dimensional state-action space in topology control is effectively mitigated, which improves the accuracy and efficiency of the control policy. Literature ^[53] proposes a DRL method for SAC incorporating an attention mechanism, aiming to manage the power system by adjusting the topology of the grid. The method improves the robustness and computational efficiency of the model by assigning different feature weights so that the attention mechanism allows the neural network to focus on the input features that are most relevant to the current target task. Literature ^[54] proposes a DRL-based approach to achieve stable autonomous control of power systems through autonomous topology optimization control using the SAC algorithm, while introducing an imitation learning (IL)-based pre-training scheme to cope with the huge action space in topology switching and the vulnerability of DRL agents in power systems.

In summary, recent studies have demonstrated the great potential of DRL in power system topology control. Aiming at the complex and huge topology space problem, which is difficult to cope with by traditional methods, these emerging methods seek the optimal policy through autonomous decision-making and dynamic adjustment, which not only improves the stability and reliability of the power system but also enhances its ability to cope with unexpected situations.

3.3 Emergency Load Shedding

Emergency load shedding is a key measure for maintaining power system stability during faults, overloads, or unexpected events. It reduces grid stress and protects equipment, but the challenge lies in making quick decisions, assessing load priorities, coordinating equipment communication, and managing diverse load characteristics for efficient shedding. As shown in Fig. 7, the emergency load shedding problem in a medium-sized power system is illustrated, where Bus 4, Bus 7, and Bus 18 are heavily loaded areas. During an overload event, shedding loads in these areas helps alleviate system stress. Fig. 8 demonstrates that during an emergency load shedding task, voltage initially drops due to the fault but gradually recovers as the load is reduced. The curve represents the voltage recovery standard, indicating that voltage should remain above the curve throughout the recovery process. To optimize load-shedding decisions, Fig. 9 presents a flowchart of using DRL to search for an effective shedding policy.

In ^[55], an accelerated DRL algorithm named “PARS” was developed to solve the problems of low computational efficiency and poor scalability of existing load shedding methods in power system voltage stabilization control, and to improve the power system stability under uncertainty and rapidly changing operating conditions through efficient and fast adaptive control. Stability. Literature ^[56] proposed a contingency control scheme for undervoltage load shedding (UVLS) based on the GraphSAGE-DDDQN method, aiming at solving the problem of insufficient adaptability and generalization ability of the existing UVLS technology in coping with the scenarios of topology changes in power networks, so as to improve the reliability and economy of the control policy. Literature ^[57] proposes a load shedding control policy based on DDPG deep strong chemistry to address the challenge of realizing autonomous voltage control in the event of power system faults, which effectively improves the stable operation of the power system by constructing a network training dataset and establishing a reward function that conforms to the operational characteristics of the power grid. Literature ^[58] proposes a load shedding policy based on deep Q-network (DQN-LS) and convolutional long and short-term memory network (ConvLSTM), aiming to improve the stability and voltage restoration capability of large-scale power systems in dynamic load shedding problems through real-time, fast and accurate load shedding decisions, especially under different and uncertain power system fault conditions. Literature ^[59] proposes a data-driven and DRL-based emergency load curtailment approach to improve the system by transforming the load curtailment policy into a MDP and optimizing it using a proximal policy optimization (PPO) algorithm to solve the problems of model complexity and matching risk faced by the traditional event-driven load curtailment policies in renewable energy systems, thus improving the system's adaptability and efficiency under multi-fault scenarios. Literature ^[60] proposes a knowledge-enhanced DDQN DRL approach for intelligent event-driven load shedding (ELS), which solves the deficiencies of traditional methods in terms of efficiency and timeliness by building a MDP based on transient stability simulation and incorporating the knowledge of removing repetitive and negative actions in order to improve the training efficiency and the quality of decision making for the effective formulation of load shedding measures. shortcomings in efficiency and timeliness.

Fig. 7. Power system emergency load shedding of [Fig. 13, ¹⁴].

Fig. 8. Power system emergency load shedding voltage recovery curve of [Fig. 4, ¹⁴].

Fig. 9. DRL-based methods for emergency load shedding in power systems [Fig. 3, ⁶¹].

In summary, recent studies have demonstrated the potential of multiple DRL-based load curtailment control methods in dealing with power system contingencies. With innovative algorithm design and flexible decision-making mechanisms, these methods effectively enhance the stability and responsiveness of the power system in the face of faults and uncertainties, while improving the accuracy of load shedding.

4. Challenges and Future Directions

Although DRL has achieved many successes in enhancing power system stability through three main means, namely energy dispatch, topology control, and emergency load shedding, it still faces several challenges, including the fact that current research is mainly focused on single-task optimization and lacks consideration of multi-task coordination, the issue of uncertainty and diversity integration of renewable energy sources, and how to achieve a reliable deployment of DRL while safeguarding the security of the power grid. In addition, the difference between the simulation environment and the real grid operation (Sim2Real) limits the wide application of DRL in real power systems. Therefore, in this section, we systematically discuss these key challenges and explore feasible solutions to promote further the research and development of DRL in power systems.

4.1 Multi-task Coordination

In current studies, energy dispatch, topology control and emergency load shedding are usually modeled and optimized separately as individual tasks. However, in the operation of real power systems, these tasks are highly interdependent and the interactions may lead to local optimization problems rather than global optimization. Traditional methods often rely on heuristic search or staged optimization, which lacks a global perspective and leads to difficulties in achieving efficient synergy between energy scheduling, topology control, and emergency load shedding. At the same time, most of the existing DRL studies use single-task learning, ignoring the intrinsic connection between these three, resulting in policies that are difficult to generalize to multi-task scenarios. Existing reinforcement learning methods, such as DQN and PPO, are mainly aimed at single-objective optimization, which makes it difficult to take into account the coordination and optimization of multiple control means in a complex power system environment. Therefore, future studies should explore the construction of a unified multi-task coordination framework, as shown in Fig. 10, in which DRL can dynamically switch between different tasks and perform comprehensive optimization under a global perspective, thereby enhancing the security and efficiency of the entire power grid. Multi-task learning (MTL) can be used to develop DRL models capable of simultaneously optimizing energy dispatch, topology control and emergency load shedding to leverage information sharing across tasks. Meanwhile, hierarchical reinforcement learning (HRL) can be used to improve the overall flexibility and adaptability of decision-making by constructing high-level and low-level policies, which enable high-level decision-making to intelligently select appropriate control methods, while low-level policies are responsible for the specific execution of the corresponding optimization tasks. In addition, the multi-objective optimization (MORL) approach can adaptively adjust the priorities of the three control means by introducing weighting factors to achieve integrated scheduling optimization, thus avoiding local optimization problems and promoting global stability. In terms of constructing global features, it can be combined with graph neural network (GNN), self-attention and other methods, so that the DRL can learn the dynamic changes of the grid structure, and combine them with the optimization policies, such as energy storage management, to improve the system responsiveness when dealing with unexpected events. This multi-task coordination approach not only enhances the flexibility of the grid but also improves the robustness and reliability of decision-making in the face of uncertainty challenges, thus promoting the development of intelligent regulation of power systems in the direction of greater efficiency and security.

Fig. 10. DRL-based multi-task coordination framework.

4.2 Renewable Energy Integration

As renewable energy sources such as wind and solar are increasingly integrated into modern power systems, managing their variability and uncertainty has become a key challenge in DRL studies. As shown in Fig. 11, current research has focused on optimizing renewable energy consumption in specific scenarios, while future research should focus on how to effectively integrate multiple renewable energy sources and optimize their dynamic allocation in the grid. Different types of renewable energy sources (e.g., wind, photovoltaic, hydropower, etc.) have significant differences in output patterns, temporal characteristics, and spatial distributions, and the standard DRL methods tend to assume that the input characteristics of the power system are fixed while ignoring the system dynamics brought about by the changes in the proportion of renewable energy sources, which makes the existing methods lack of adaptability and generalization when facing the multi-source integration problem. Therefore, future research directions should consider how to enable DRL to quickly adapt to different renewable energy environments and improve its ability to cope with grid volatility. For example, adopting a DRL method based on Meta-Learning (ML) can enable the agent to learn and adapt itself quickly in different types of renewable energy environments and improve its ability to cope with changes in system dynamics. In addition, the introduction of uncertainty modeling techniques, such as Bayesian DRL (Bayesian DRL) or probabilistic graphical models (PGMs), can enable DRL to deal with the uncertainty of renewable energy sources more efficiently, thus improving the robustness of scheduling decisions. There have been studies attempting to optimize the co-dispatch of wind and PV systems using DRL in combination with energy storage management, but the challenges of algorithms' generalization ability and adaptability are still being faced. Therefore, the future development direction should focus on how to improve the stability of DRL under uncertain environments so that it can effectively integrate different renewable energy resources under dynamic grid conditions, and ultimately improve the overall stability and operational efficiency of the grid.

Fig. 11. Challenges of DRL in integrating multiple renewable energies.

4.3 Safety Constraints and Sim2Real

The issue of security constraints is crucial in the practical application of DRL in power systems, but existing DRL methods may explore infeasible or even dangerous decisions during the training process, such as over-adjusting the topology, which instead increases the vulnerability of the grid, or adopting extreme load shedding policies, which leads to power tidal imbalance or even violates the grid security standards. Therefore, as shown in Fig. 12, future research needs to explore how to incorporate domain knowledge of the power system into the DRL decision framework and introduce physical security constraints during the training process to ensure that the policies learned by the model are always in line with grid security requirements. For example, constrained DRL or barrier function methods can be used to enable DRL to adaptively avoid security risks during the learning process and to ensure that its control policy does not destabilize the power grid. In addition, the application of DRL in power systems faces the Sim2Real (from simulation to reality) problem, which means that the difference between the simulation environment and the actual grid operation may cause the trained model to fail in the real environment. Existing research mainly relies on the simulation environment for DRL training, but the deviation between the simulation model and the physical characteristics of the real grid leads to the fact that the DRL policy may perform well in the simulation environment but has insufficient generalization ability to cope with the complex operating conditions of the real grid when deployed in practice. Therefore, future research should focus on exploring how to narrow this gap, such as using methods such as transfer learning and imitation learning to make DRL's policy transfer between different environments more stable or adopting model-free online DRL, which enables the agent to continuously learn and optimize the control policy directly in the real grid, thus improving its practical adaptability. In addition, the interpretability problem of DRL remains a key challenge limiting its application in grid control, as its decision logic is often hard to understand, making power engineers hesitant to trust its policies, thus affecting deployment in safety-critical systems. To address this, approaches like attention mechanisms or causal inference can enhance interpretability, making decision-making more transparent while providing a visualized basis, thereby improving the trustworthiness and usefulness of DRLs in power systems.

Fig. 12. Safety learning and Sim2Real gap minimization.

5. Conclusion

Amid global decarbonization efforts, modern power systems are becoming increasingly complex due to the large-scale integration of renewable energy, posing significant challenges to system stability and operational efficiency. DRL has emerged as a promising solution to address these challenges, offering adaptive learning and decision-making capabilities that surpass traditional optimization methods in high-dimensional and dynamic environments. This paper provides a systematic overview of DRL applications in power systems, with a particular focus on its optimization strategies for energy dispatch, topology control, and emergency load shedding.

Our findings highlight the significant advancements of DRL in optimizing these control measures, demonstrating its potential to enhance power system stability, flexibility, and resilience. However, key challenges remain, including multi-task coordination, renewable energy integration, safety constraints, and Sim2Real transferability. Addressing these challenges will ensure the practical deployment and effectiveness of DRL-based solutions in real-world power systems.

As power grids continue to evolve, the insights provided in this paper establish a foundation for future research, guiding the development of more robust, efficient, and safe DRL frameworks. Advancing these research directions will not only drive innovation in power system control but also play a crucial role in supporting the transition toward a more sustainable and intelligent energy infrastructure.

Acknowledgements

This research was supported in part by the KEPCO under the project entitled by “Development of GW class voltage sourced DC linkage technology for improved interconnectivity and carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in part by the Institute of Information & communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).

References

R. Detchon and R. Van Leeuwen, “Policy: Bring sustainable energy to the developing world,” Nature, vol. 508, no. 7496, pp. 309–311, 2014. DOI:10.1038/508309a

H. Hu, N. Xie, D. Fang and X. Zhang, “The role of renewable energy consumption and commercial services trade in carbon dioxide reduction: Evidence from 25 developing countries,” Applied energy, vol. 211, pp. 1229–1244, 2018. DOI:10.1016/j.apenergy.2017.12.019

J. Wu, J. Yan, H. Jia, N. Hatziargyriou, N. Djilali and H. Sun, “Integrated energy systems,” Applied energy, vol. 167, pp. 155–157, 2016. DOI:10.1016/j.apenergy.2016.02.075

B. Kroposki, “Integrating high levels of variable renewable energy into electric power systems,” Journal of Modern Power Systems and Clean Energy, vol. 5, no. 6, pp. 831–837, 2017. DOI:10.1007/s40565-017-0339-3

M. L. Tuballa and M. L. Abundo, “A review of the development of smart grid technologies,” Renewable and Sustainable Energy Reviews, vol. 59, pp. 710–725, 2016. DOI:10.1016/j.rser.2016.01.011

J. Keirstead, M. Jennings and A. Sivakumar, “A review of urban energy system models: Approaches, challenges and opportunities,” Renewable and Sustainable Energy Reviews, vol. 16, no. 6, pp. 3847–3866, 2012. DOI:10.1016/j.rser.2012.02.047

M. F. Zia, E. Elbouchikhi and M. Benbouzid, “Microgrids energy management systems: A critical review on methods, solutions, and prospects,” Applied energy, vol. 222, pp. 1033–1055, 2018. DOI:10.1016/j.apenergy.2018.04.103

S. Impram, S. V. Nese and B. Oral, “Challenges of renewable energy penetration on power system flexibility: A survey,” Energy Strategy Reviews, vol. 31, no. 100539, pp. 1-12, 2020. DOI:10.1016/j.esr.2020.100539

D. Liu, Q. Yang, Y. Chen, X. Chen and J. Wen, “Optimal parameters and placement of hybrid energy storage systems for frequency stability improvement,” Protection and Control of Modern Power Systems, vol. 10, no. 2, pp. 40–53, 2025. DOI:10.23919/PCMP.2023.000259

K. Liu, Z. Chen, X. Li and Y. Gao, “Analysis and control parameters optimization of wind turbines participating in power system primary frequency regulation with the consideration of secondary frequency drop,” Energies, vol. 18, no. 6, pp. 1–19, 2025. DOI:10.3390/en18061317

M. Dahane, A. Benali, H. Tedjini, A. Benhammou, M. A. Hartani and H. Rezk, “Optimized double-stage fractional order controllers for dfig-based wind energy systems: A comparative study,” Results in Engineering, vol. 25, no. 104584, pp. 1-17, 2025. DOI:10.1016/j.rineng.2025.104584

L. Cheng and T. Yu, “A new generation of ai: A review and perspective on machine learning technologies applied to smart energy and electric power systems,” International Journal of Energy Research, vol. 43, no. 6, pp. 1928–1973, 2019. DOI:10.1002/er.4333

M. M. Gajjala and A. Ahmad, “A survey on recent advances in transmission congestion management,” International Review of Applied Sciences and Engineering, vol. 13, no. 1, pp. 29–41, 2021. DOI:10.1556/1848.2021.00286

H. Zhang, X. Sun, M. H. Lee and J. Moon, “Deep reinforcement learning based active network management and emergency load-shedding control for power systems,” IEEE Transactions on Smart Grid, vol. 15, no. 2, pp. 1423-1437, 2023. DOI:10.1109/TSG.2023.3302846

S. M. Mohseni-Bonab, I. Kamwa, A. Rabiee and C. Chung, “Stochastic optimal transmission switching: A novel approach to enhance power grid security margins through vulnerability mitigation under renewables uncertainties,” Applied Energy, vol. 305, no. 117851, pp. 1-14, 2022.DOI:10.1016/j.apenergy.2021.117851

D. Michaelson, H. Mahmood and J. Jiang, “A predictive energy management system using pre-emptive load shedding for islanded photovoltaic microgrids,” IEEE Transactions on Industrial Electronics, vol. 64, no. 7, pp. 5440–5448, 2017. DOI:10.1109/TIE.2017.2677317

R. S. Sutton, “Reinforcement learning: An introduction,” A Bradford Book, 2018. DOI:10.1017/S0263574799271172

D. Cao, W. Hu, J. Zhao, G. Zhang, B. Zhang, Z. Liu, Z. Chen and F. Blaabjerg, “Reinforcement learning and its applications in modern power and energy systems: A review,” Journal of modern power systems and clean energy, vol. 8, no. 6, pp. 1029–1042, 2020. DOI:10.35833/MPCE.2020.000552

E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu and J. G. Slootweg, “On-line building energy optimization using deep reinforcement learning,” IEEE transactions on smart grid, vol. 10, no. 4, pp. 3698–3708, 2018. DOI:10.1109/TSG.2018.2834219

Y. Zhang, X. Wang, J. Wang and Y. Zhang, “Deep reinforcement learning based volt-var optimization in smart distribution systems,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 361–371, 2020. DOI:10.1109/TSG.2020.3010130

Z. Yan and Y. Xu, “Data-driven load frequency control for stochastic power systems: A deep reinforcement learning method with continuous action search,” IEEE Transactions on Power Systems, vol. 34, no. 2, pp. 1653–1656, 2018. DOI:10.1109/TPWRS.2018.2881359

Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan and Z. Huang, “Adaptive power system emergency control using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 2, pp. 1171–1182, 2019. DOI:10.1109/TSG.2019.2933191

Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for power system applications: An overview,” CSEE Journal of Power and Energy Systems, vol. 6, no. 1, pp. 213–225, 2019. DOI:10.17775/CSEEJPES.2019.00920

Q. Li, T. Lin, Q. Yu, H. Du, J. Li and X. Fu, “Review of deep reinforcement learning and its application in modern renewable power system control,” Energies, vol. 16, no. 10, pp. 1–23, 2023. DOI:10.3390/en16104143

J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Machine learning, vol. 16, pp. 185–202, 1994. DOI:10.1007/BF00993306

A. Agarwal, S. M. Kakade, J. D. Lee and G. Mahajan, “Optimality and approximation with policy gradient methods in markov decision processes,” in Conference on Learning Theory.PMLR, vol. 125, pp. 64–66, 2020. https://proceedings.mlr.press/v125/agarwal20a.html

H. Wang and B. Raj, “On the origin of deep learning,” arXiv preprint arXiv:1702.07800, 2017. DOI:10.48550/arXiv.1702.07800

Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015. DOI:10.1038/nature14539

R. Sun, “Optimization for deep learning: theory and algorithms,” arXiv preprint arXiv:1912.08957, 2019. DOI:10.48550/arXiv.1912.08957

J. Tsitsiklis and B. Van Roy, “Analysis of temporal-diffference learning with function approximation,” Advances in neural information processing systems, vol. 9, pp. 1-7, 1996. https://proceedings.neurips.cc/paper_files/paper/1996/file/e00406144c1e7e35240afed70f34166a-Paper.pdf

C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, pp. 279–292, 1992. DOI:10.1007/BF00992698

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:10.1038/nature14236

J. Fan, Z. Wang, Y. Xie and Z. Yang, “A theoretical analysis of deep q-learning,” in Learning for dynamics and control. PMLR, vol. 120, pp. 486–489, 2020. https://proceedings.mlr.press/v120/yang20a.html

H. Van Hasselt, A. Guez and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, pp. 2094-2100, 2016. DOI:10.1609/aaai.v30i1.10295

Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning.PMLR, vol. 48, pp. 1995–2003, 2016. https://proceedings.mlr.press/v48/wangf16.html

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, “Deterministic policy gradient algorithms,” in International conference on machine learning. PMLR, vol. 32, no. 1, pp. 387–395, 2014. https://proceedings.mlr.press/v32/silver14.html

R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Advances in neural information processing systems, vol. 12, pp. 1-7, 1999. https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

T. Degris, M. White and R. S. Sutton, “Off-policy actor-critic,” arXiv preprint arXiv:1205.4839, 2012. DOI:10.48550/arXiv.1205.4839

S. Li, S. Bing and S. Yang, “Distributional advantage actor-critic,” arXiv preprint arXiv:1806.06914, 2018. DOI:10.48550/arXiv.1806.06914

V. Mnih, “Asynchronous methods for deep reinforcement learning,” arXiv preprint arXiv:1602.01783, 2016. DOI:10.48550/arXiv.1602.01783

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018. DOI:10.48550/arXiv.1812.05905

J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. DOI:10.48550/arXiv.1707.06347

S. Pateria, B. Subagdja, A.-h. Tan and C. Quek, “Hierarchical reinforcement learning: A comprehensive survey,” ACM Computing Surveys (CSUR), vol. 54, no. 5, pp. 1–35, 2021. DOI:10.1145/3453160

S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang and A. Knoll, “A review of safe reinforcement learning: Methods, theories and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 11216–11235, 2024. DOI:10.1109/TPAMI.2024.3457538

F. Meng, Y. Bai and J. Jin, “An advanced real-time dispatching strategy for a distributed energy system based on the reinforcement learning algorithm,” Renewable energy, vol. 178, pp. 13–24, 2021. DOI:10.1016/j.renene.2021.06.032

T. Yang, L. Zhao, W. Li and A. Y. Zomaya, “Dynamic energy dispatch strategy for integrated energy system based on improved deep reinforcement learning,” Energy, vol. 235, no. 121377, pp. 1-15, 2021. DOI:10.1016/j.energy.2021.121377

A. S. Ebrie and Y. J. Kim, “Reinforcement learning-based optimization for power scheduling in a renewable energy connected grid,” Renewable Energy, vol. 230, no. 120886, pp. 1-27, 2024. DOI:10.1016/j.renene.2024.120886

X. Han, C. Mu, J. Yan and Z. Niu, “An autonomous control technology based on deep reinforcement learning for optimal active power dispatch,” International Journal of Electrical Power & Energy Systems, vol. 145, no. 108686, pp. 1-10, 2023. DOI:10.1016/j.ijepes.2022.108686

X. Zhou, J. Wang, X. Wang and S. Chen, “Optimal dispatch of integrated energy system based on deep reinforcement learning,” Energy Reports, vol. 9, pp. 373–378, 2023. DOI:10.1016/j.egyr.2023.09.157

I. Damjanović, I. Pavić, M. Puljiz and M. Brcic, “Deep reinforcement learning-based approach for autonomous power flow control using only topology changes,” Energies, vol. 15, no. 19, pp. 1-16, 2022. DOI:10.3390/en15196920

M. Subramanian, J. Viebahn, S. H. Tindemans, B. Donnot and A. Marot, “Exploring grid topology reconfiguration using a simple deep reinforcement learning approach,” in 2021 IEEE Madrid PowerTech, pp. 1–6, 2021. DOI:10.1109/PowerTech46648.2021.9494879

Z. Yang, Z. Qiu, Y. Wang, C. Yan, X. Yang and G. Deconinck, “Power grid topology regulation method based on hierarchical reinforcement learning,” in 2024 Second International Conference on Cyber-Energy Systems and Intelligent Energy (ICCSIE), pp. 1–6, 2024. DOI:10.1109/ICCSIE61360.2024.10698617

Z. Qiu, Y. Zhao, W. Shi, F. Su and Z. Zhu, “Distribution network topology control using attention mechanism-based deep reinforcement learning,” in 2022 4th International Conference on Electrical Engineering and Control Technologies (CEECT), pp. 55–60, 2022. DOI:10.1109/CEECT55960.2022.10030642

X. Han, Y. Hao, Z. Chong, S. Ma and C. Mu, “Deep reinforcement learning based autonomous control approach for power system topology optimization,” in 2022 41st Chinese Control Conference (CCC), pp. 6041–6046, 2022. DOI:10.23919/CCC55666.2022.9902073

R. Huang, Y. Chen, T. Yin, X. Li, A. Li, J. Tan, W. Yu, Y. Liu and Q. Huang, “Accelerated deep reinforcement learning based load shedding for emergency voltage control,” arXiv preprint arXiv:2006.12667, 2020. DOI:10.48550/arXiv.2006.12667

Y. Pei, J. Yang, J. Wang, P. Xu, T. Zhou and F. Wu, “An emergency control strategy for undervoltage load shedding of power system: A graph deep reinforcement learning method,” IET Generation, Transmission & Distribution, vol. 17, no. 9, pp. 2130–2141, 2023. DOI:10.1049/gtd2.12795

J. Li, S. Chen, X. Wang and T. Pu, “Load shedding control strategy in power grid emergency state based on deep reinforcement learning,” CSEE Journal of Power and Energy Systems, vol. 8, no. 4, pp. 1175–1182, 2021. DOI:10.17775/CSEEJPES.2020.06120

J. Zhang, Y. Luo, B. Wang, C. Lu, J. Si and J. Song, “Deep reinforcement learning for load shedding against short-term voltage instability in large power systems,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 4249–4260, 2021. DOI:10.1109/TNNLS.2021.3121757

H. Chen, J. Zhuang, G. Zhou, Y. Wang, Z. Sun and Y. Levron, “Emergency load shedding strategy for high renewable energy penetrated power systems based on deep reinforcement learning,” Energy Reports, vol. 9, pp. 434–443, 2023. DOI:10.1016/j.egyr.2023.03.027

Z. Hu, Z. Shi, L. Zeng, W. Yao, Y. Tang and J. Wen, “Knowledge-enhanced deep reinforcement learning for intelligent event-based load shedding,” International Journal of Electrical Power & Energy Systems, vol. 148, no. 108978, pp. 1-11, 2023. DOI:10.1016/j.ijepes.2023.108978

Y. Zhang, M. Yue and J. Wang, “Adaptive load shedding for grid emergency control via deep reinforcement learning,” in 2021 IEEE Power & Energy Society General Meeting (PESGM). pp. 1-5, 2021. DOI:10.1109/PESGM46819.2021.9638058

저자소개

장호천(Haotian Zhang)

Haotian Zhang received the B.S. degree in mechanical engineering from Qingdao University of Science and Technology, Qingdao, China, and Hanyang University, Ansan, South Korea, in 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang University, Seoul, South Korea. His research interests include optimal control, smart grid, deep reinforcement learning, and their applications.

왕천(Chen Wang)

Chen Wang received the B.S. degree in electronics and computer engineering and the M.S. degree in electronic computer engineering from Chonnam National University, South Korea, in 2020 and 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang University, Seoul, South Korea. His research interests include smart grid, deep reinforcement learning, and their applications.

이민주(Minju Lee)

Minju Lee received the B.S. degree in climate and energy systems engineering from Ewha Womans University, Seoul, South Korea, in 2022, where she is currently pursuing the degree with the Department of Climate and Energy Systems Engineering. Her research interests include short-term wind power forecasting and the probabilistic estimation of transmission congestion for grid integration.

이명훈(Myoung Hoon Lee)

Myoung Hoon Lee received the B.S. degree in electrical engineering from Kyungpook National University, Daegu, South Korea, in 2016, and the Ph.D. degree in electrical engineering from the Ulsan National Institute of Science and Technology, Ulsan, South Korea, in 2021. From 2021 to 2023, he was a Postdoctoral Research Fellow with the Research Institute of Electrical and Computer Engineering, Hanyang University, Seoul, South Korea. He is currently an Assistant Professor with the Department of Electrical Engineering, Incheon National University, Incheon, South Korea. His research interests include decentralized optimal control, mean field games, deep reinforcement learning, and their applications.

문준(Jun Moon)

Jun Moon is currently an Associate Professor in the Department of Electrical Engineering at Hanyang University, Seoul, South Korea. He received the B.S. degree in electrical and computer engineering, and the M.S. degree in electrical engineering from Hanyang University, Seoul, South Korea, in 2006 and 2008, respectively. He received the Ph.D. degree in electrical and computer engineering from University of Illinois at Urbana-Champaign, USA, in 2015. From 2008 to 2011, he was a researcher at Agency for Defense Development (ADD) in South Korea. From 2016 to 2019, he was with the School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology (UNIST), South Korea, as an assistant professor. From 2019 to 2020, he was with the School of Electrical and Computer Engineering, University of Seoul, South Korea, as an associate professor. He is a recipient of the Fulbright Graduate Study Award 2011. His research interests include stochastic optimal control and filtering, reinforcement learning, data-driven control, distributed control, networked control systems, and mean field games.

KIEEThe Transactions ofthe Korean Institute of Electrical Engineers

The Transactions of the Korean Institute of Electrical Engineers

ISO Journal TitleTrans. Korean. Inst. Elect. Eng.

Journal Search

Journal XML

Journal Information

A Survey on Deep Reinforcement Learning Approaches for Power System Control and Optimization

Translated Abstract

Key words

1. Introduction

1.1 Main Contributions

2. Review of Deep Reinforcement Learning

2.1 Reinforcement Learning

2.1.1 Value Function and Optimal Policy

(1)

(2)

(3)

(4)

(5)

2.2 Deep Reinforcement Learning

(6)

(7)

2.3 DRL Related Algorithms

2.3.1 Deterministic Policy Gradient (DPG)

(8)

2.3.2 Actor Critic Methods

(9)

2.3.3 Proximal Policy Optimization (PPO)

(10)

2.3.4 Hierarchical Reinforcement Learning (HRL)

2.3.5 Safe Reinforcement Learning (SRL)

3. Application in Power System

3.1 Energy Dispatch

3.2 Topology Control

3.3 Emergency Load Shedding

4. Challenges and Future Directions

4.1 Multi-task Coordination

4.2 Renewable Energy Integration

4.3 Safety Constraints and Sim2Real

5. Conclusion

Acknowledgements

References

저자소개

장호천(Haotian Zhang)

왕천(Chen Wang)

이민주(Minju Lee)

이명훈(Myoung Hoon Lee)

문준(Jun Moon)

Article Information (continued)

Key words

KIEEThe Transactions of
the Korean Institute of Electrical Engineers