장호천
(Haotian Zhang)
1iD
왕천
(Chen Wang)
1iD
이민주
(Minju Lee)
2iD
이명훈
(Myoung Hoon Lee)
††iD
문준
(Jun Moon)
†iD
-
(Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of
Korea.)
-
(KEPCO Research Institute, Daejeon Metropolitan City 34056, Republic of Korea.)
Copyright © The Korea Institute for Structural Maintenance and Inspection
Key words
Deep reinforcement learning, energy dispatch, topology control, emergency load shedding
1. Introduction
With the large-scale popularization of renewable energy, the global energy structure
is undergoing profound changes. Fossil energy sources are gradually being depleted,
while advances in clean energy technologies have made renewable energy sources such
as solar and wind energy progressively more economical than traditional energy sources.
Innovative models such as microgrids and decentralized energy systems have facilitated
energy penetration, accelerated energy transition [1], and promoted low-carbon development [2]. Modern energy systems are evolving towards multi-energy complementarity and smart
low-carbon, with the deep coupling of electricity, natural gas, hydrogen, and other
energy sources, enhanced energy storage and demand-side flexibility [3], promoting the reconfiguration of traditional energy systems. However, this transformation
also brings great challenges, especially the volatility and instability of renewable
energy. With large-scale access to variable renewable energy (VRE) sources such as
wind and solar, the power system faces greater challenges in balancing supply and
demand [4], especially when the traditional grid struggles to adapt to the growing electricity
demand and the increased penetration of renewable energy sources, which exacerbates
the risk of supply-demand imbalance [5]. In addition, the complexity of urban energy systems also puts higher demands on
energy efficiency and policy assessment, and current modeling of urban energy systems
still faces challenges in technical design, building design, urban climate, system
design, and policy assessment, such as model complexity, data quality, and uncertainty
[6].
In this context, how to ensure the security, stability and efficiency of the power
system has become an urgent problem. Among them, energy dispatch, topology control,
and emergency load shedding are the three core tasks to ensure the stable operation
of the power grid. However, the traditional model-based control methods for these
three tasks show obvious limitations in dealing with complex dynamic changes [7]. This is because power systems gradually exhibit nonlinearities, uncertainties, and
stochasticity, which make it difficult for physical models to effectively deal with
actual operating conditions. Moreover, under rapidly changing power demand and energy
supply conditions, traditional methods may have lagged responses and fail to achieve
optimal control, thus affecting the overall efficiency and reliability of the power
grid [8]. For example, in [9], it is investigated to optimize the parameter configuration and siting of a hybrid
energy storage system by constructing a simplified frequency response (SFR) model
and combining it with an explicit gradient calculation method. In [10], the study constructs a modified SFR model and optimizes the inertia control and
sag control parameters of the wind turbine to enhance the primary frequency regulation
performance. In [11], a Fractional Order Proportional Integral Derivative-Fractional Order Proportional
Integral (FOPID-FOPI) two-stage controller based on fractional order proportional
integral-derivative (FOPID) is proposed for direct power control of DFIG-based wind
energy systems. All of [9-11] have achieved stable grid operation in terms of individual scheduling and direct
control, but these methods still rely on physical modeling. In the face of these limitations,
there is an urgent need for more flexible and efficient control and optimization tools
to achieve stable power system operation [12] and to meet the challenges of the three core tasks of energy dispatch, topology control
and emergency load shedding [13,14].
Among these three tasks, energy dispatching requires precise generation planning to
adapt to fluctuating electricity demand and unstable renewable energy supply; topology
control optimizes the grid structure and dynamically adjusts the power flow to prevent
transmission line overload and improve system security [15]; and emergency load shedding serves as the last line of defense for grid security
by intelligently and selectively disconnecting some of the loads in extreme situations
to prevent cascading failures from triggering large-scale blackouts [16]. These tasks are crucial for power system stability and security, but traditional
methods are difficult to cope with the high dimensionality, dynamic changes and complex
coupling relationships involved. Therefore, there is an urgent need to introduce intelligent
optimization methods to provide more flexible and real-time solutions under uncertain
environments to ensure the safe and efficient operation of power grids under complex
conditions.
Deep reinforcement learning (DRL) [17] has emerged as a key research focus for addressing power system challenges. DRL algorithms
are able to deal with complex dynamic environments in the power system and overcome
the limitations of the traditional physical models in nonlinear and stochastic problems
through a data-driven approach [18,19]. It is worth noting that [18] mainly reviewed the application of DRL in power system optimization and control,
emphasized the advantages of DRL over traditional physical model-based methods in
dealing with complexity and uncertainty, and summarized the research progress of DRL
in various fields, such as smart grids, demand-side management, and power markets.
While this paper focuses more on the application of DRL in power scheduling, topology
switching and emergency load shedding, focusing on its optimization policies and technological
breakthroughs in high-dimensional dynamic environments and analyzing how to enhance
the flexibility and real-time decision-making capability of power grids. In addition,
DRL not only learns the scheduling and control policies automatically but also responds
to the system changes in real-time, significantly improving the operational efficiency
and responsiveness of the power system [20-22]. Specifically,[20] introduces a multi-agent DRL-based volt-VAR optimization (VVO) algorithm that optimizes
scheduling and control policies, improving operational efficiency and responsiveness.
[21] presents a data-driven, model-free DRL-based load frequency control (LFC) method,
achieving faster response, stronger adaptability, and enhanced frequency regulation
under renewable energy uncertainties. [22] develops adaptive DRL-based emergency control schemes, reinforcing grid security
and resilience through robust policies for generator dynamic braking and under-voltage
load shedding. Not only that, optimizing energy dispatch, topology control, and emergency
load shedding with DRL effectively balances supply-demand fluctuations while flexibly
managing grid dynamics to ensure stable system operation [23].
1.1 Main Contributions
The related review papers [18,23] and [24] all provide a systematic overview of DRL applications in power systems, covering
its basic principles, algorithm classification, and research progress in the fields
of power dispatch, demand response, power market and operation control. Compared to
this paper, [18,23] and [24] focus more on the basic theory, algorithm development and overall application prospect
of DRL.
Different from the existing related review studies, this paper systematically combs
through the current research status of DRL in power systems and its applications,
and thoroughly discusses the three key control means energy dispatch, topology control,
and emergency load shedding in the stable operation of power systems. By comprehensively
analyzing the existing research results and technological evolution, this paper summarizes
the advantages of DRL in power system optimization, and at the same time reveals the
limitations and challenges of the current research, and further proposes possible
future research directions and improvement policies to promote the practical application
and development of DRL in smart grids. The specific contributions are as follows:
1. Relevant methods of DRL and their applications in power systems are systematically
summarized, covering three control tools for the stable operation of power systems:
energy dispatch, topology control, and emergency load shedding, providing researchers
with a clear overview of the current state of research.
2. Future research directions in the areas of energy dispatch, topology control, and
emergency load shedding are explored with a focus on key challenges such as multi-task
coordination, renewable energy integration, safety constraints, and Sim2Real migration.
This paper proposes to enhance task optimization through multi-task reinforcement
learning and hierarchical reinforcement learning, enhance DRL adaptation to renewable
energy uncertainty by combining meta-learning and probabilistic modeling, and ensure
grid security by using constraint-based reinforcement learning. Meanwhile, transfer
learning and model-free DRL (Online Policy) are emphasized to bridge the gap between
simulation and the actual grid, and to promote the secure deployment and application
of DRL.
In the remainder of this paper, Section 2 will provide a comprehensive overview of
the fundamentals of DRL and advanced techniques. Section 3 will describe the applications
of DRL to three power system problems: energy dispatch, topology control, and emergency
load shedding. Section 4 will discuss several potential future research directions.
Finally, we conclude the paper in Section 5.
2. Review of Deep Reinforcement Learning
This section establishes the basic formulation of reinforcement learning (RL) problems
and introduces key concepts such as the Q-function and the Bellman equation. These
foundational concepts provide support for understanding subsequent algorithms. Next,
we discuss classical RL algorithms, categorizing them into value-based and policy-based
methods. Finally, we will introduce several advanced RL techniques, including DRL,
Deterministic Policy Gradient, Actor-Critic methods, hierarchical RL, and Safe RL.
2.1 Reinforcement Learning
RL is a branch of machine learning that focuses on how an agent can make sequential
decisions in uncertain environments to maximize cumulative rewards. Mathematically,
the decision-making problem can be modeled as a Markov decision process (MDP), which
consists of a state space $S$, an action space $A$, and a transition probability function
$P(\bullet | s,\: a): S\times A → S$ that maps state-action pairs $(s,\: a)\in S\times
A$ to the state space, along with a reward function $r(s,\: a): S\times A → R$.
In the MDP setting, the environment begins from an initial state $s_{0}\in S$. At
each time step $t =\{0,\: 1,\: ...\}$, given the current state $s_{t}\in S$, the agent
selects an action $a_{t}\in A$, and based on the current state-action pair $(s_{t},\:
a_{t})$, receives a corresponding reward $r(s_{t},\: a_{t})$. Subsequently, the next
state $s_{t+1}$ is randomly generated according to the transition probability $P(s_{t+1}|
s_{t},\: a_{t})$. The agent’s policy $\pi(a | s)\in A$ is a mapping from the state
$s$ to a distribution over the action space $A$, which specifies the actions that
should be taken in a given state $s$. The goal of the agent is to find an optimal
policy $\pi^{*}$, though this policy may not be unique. The MDP process is illustrated
in Fig. 1.
Fig. 1. Decision process of Markov decision process.
2.1.1 Value Function and Optimal Policy
To maximize long-term cumulative rewards after the current time $t$, the return $R_{t}$
over a finite time horizon can be expressed as:
where the discount factor $\gamma\in[0,\: 1]$ is a parameter used to discount future
rewards.
To find the optimal policy, some algorithms rely on the value function $V_{\pi}(s)$,
which represents the expected return when the agent reaches a given state $s$. This
function depends on the agent’s actual policy $\pi$:
Similarly, the action-value function Q represents the value of taking action $a$ in
state $s$:
The value function $V$ and action-value function $Q$ can be expressed iteratively
through the Bellman equation:
The optimal policy $\pi^{*}$ is the policy that maximizes the long-term cumulative
return, which can be expressed as:
2.2 Deep Reinforcement Learning
The journey from RL to DRL has undergone a long developmental process. In classical
tabular RL such as Q-learning, the state and action spaces are usually small, allowing
the approximate value function to be represented as a Q-value table. In this case,
these methods are often able to find the exact optimal value function and policy [25,26]. However, in many real-world problems, the state and action spaces are large or continuous,
and the system dynamics are highly complex. Therefore, value-based RL struggles to
compute or store a huge Q-value table for all state-action pairs.
To address this problem, researchers developed function approximation methods, using
parameterized function classes such as linear functions or polynomial functions to
approximate the Q function. For policy-based RL, finding an appropriate policy class
to achieve optimal control is also crucial in high-dimensional complex tasks. With
advances in deep learning, the use of artificial neural networks (ANN) for function
approximation or policy parameterization has become increasingly popular in DRL. The
theoretical foundations and historical evolution of deep learning [27], its breakthroughs in representation learning [28], and optimization techniques for training deep models [29] have all contributed to the widespread adoption of ANN in this field. Specifically,
DRL can be implemented in the following ways:
(1) Value-based methods
In temporal difference (TD) learning [30] and Q-learning [31], a Q-network can be used to approximate the Q function. For TD learning, the update
rule for parameters 𝑤 is:
where the gradient $\nabla_{\omega}\phi Q_{\omega}(s_{t},\: a_{t})$ can be efficiently
computed using backpropagation. In Q-learning, approximating using nonlinear functions
(e.g., ANN) often results in instability and divergence during training. To address
these issues, researchers developed Deep Q-Networks (DQN) [32], which significantly improved the stability of Q-learning through two key techniques.
▪ Experience replay: Instead of training on consecutive episodes, a widely used technique
is to store all transition experiences $D =(s_{t},\: a_{t},\: r_{t},\: s_{t+1})$ in
a database called the replay buffer $D$. At each step, a batch of transitions is randomly
sampled from the replay buffer $D$ for Q-learning updates. This method improves data
efficiency by recycling past experiences and reduces the variance in learning updates.
More importantly, uniformly sampling from the replay buffer breaks the temporal correlation,
enhancing the stability and convergence of Q-learning.
▪ Target network: Another technique is to introduce a target network $Q_{\omega}^{-}(s,\:
a)$, which is a clone of the Q-network $Q_{\omega}(s,\: a)$. The parameters $\omega$
of the target network remain frozen during training and are only updated periodically.
Specifically, a batch of transitions $(s_{i},\: a_{i},\: r_{i},\: s_{i+1}| i=0,\:
1,\: ...)$ is sampled from the replay buffer $D$ for training, and the Q-network is
updated using the following formula:
This optimization process can be seen as finding an approximate solution to satisfy
the Bellman optimality equation [33]. The key is to use the target network $Q_{\omega}^{-}$ with parameters $\omega_{i+1}$
to compute the maximum action value, rather than using the Q-network $Q_{\omega}$
directly. After a fixed number of updates, the parameters $\omega_{i+1}$ are replaced
with the newly learned $\omega$ to update the target network. This technique mitigates
instability and prevents short-term oscillations during training.
Moreover, several notable DQN variants can further improve performance, such as Double
DQN [34] and Dueling DQN [35].
Double DQN addresses the overestimation bias in DQN by learning two sets of Q functions—one
for selecting actions and the other for evaluating their values. Dueling DQN introduces
a novel network architecture that decomposes the estimation of the Q-value into two
parts: one part estimates the state value function $V(s)$, and the other estimates
the action advantage function $A(s,\: a)$, which is dependent on the state. Through
this decomposition, the dueling network allows the agent to better assess the importance
of different states during the learning process, even when the choice of actions does
not directly affect the learning. This architecture helps improve learning efficiency,
particularly in situations where the Q-values of different actions have minimal differences
in certain states.
(2) Policy-based methods
Due to their strong generalization capabilities, artificial neural networks are widely
used to parameterize control policies, especially when state and action spaces are
continuous. The resulting policy network $NN(a | s ;\theta)$ takes the state as input
and outputs the probabilities of selecting actions. In Actor-Critic methods, both
a Q-network $NN(s ,\: a ;\omega)$ and a policy network $NN(a | s ;\theta)$ are typically
used, where the "actor" updates the parameters $\theta$ according to the policy, and
the "critic" updates the parameters $\omega$ according to the Q-values. The gradient
of an ANN can be efficiently computed using backpropagation [29].
When using function approximation, theoretical analyses of value-based and policy-based
RL methods are relatively scarce and are usually limited to linear function approximations.
Furthermore, one major challenge for value-based methods when dealing with large or
continuous action spaces is the difficulty of executing the maximization step. For
instance, when using deep artificial neural networks to approximate the Q function,
finding the optimal action $a$ is not trivial due to the nonlinearity and complexity
of $Q_{\omega}(s,\: a)$.
2.3 DRL Related Algorithms
In this subsection, we will explore several advanced DRL algorithms, including Deep
Deterministic Policy Gradient (DDPG), Actor-Critic (AC), Proximal Policy Optimization
(PPO), Hierarchical Reinforcement Learning (HRL), and Safe Reinforcement Learning
(SRL). These algorithms demonstrate exceptional performance in handling complex environments
and tasks, driving the application and development of DRL. The framework structure
of the DRL algorithm is shown in Fig. 2.
Fig. 2. Classification of Deep Reinforcement Learning Algorithms by Framework: Value-based,
Policy-based, and Actor-Critic-based.
2.3.1 Deterministic Policy Gradient (DPG)
Most reinforcement learning algorithms focus on stochastic policies $a\sim\pi_{\theta}(a
| s)$, but deterministic policies $a =\pi_{\theta}(s)$ [36] are more suitable for many real-world control problems with continuous state and
action spaces. This is because, on the one hand, many existing controllers for physical
systems (such as PID and robust control) are deterministic, making deterministic policies
a better match for practical control architectures, especially in power system applications.
On the other hand, deterministic policies are more sample-efficient, as their policy
gradient only integrates over the state space, while the gradient of stochastic policies
integrates over both the state and action spaces. Similar to stochastic policies,
deterministic policies also have a policy gradient theorem [37], which expresses the gradient as:
A key issue with deterministic policies is the lack of exploration due to deterministic
action selection. A common approach to address this is to apply exploration noise
to the deterministic policy, such as adding Gaussian noise $\xi$ to the policy $a
=\pi_{\theta}(s)+\xi$.
2.3.2 Actor Critic Methods
Actor critic algorithms [38] combine the advantages of policy gradient and value iteration. In this framework,
the actor and critic networks perform different functions. Specifically, the actor
network is responsible for policy optimization, outputting the action $a\sim\pi_{\theta}(a
| s)$ for a given state by directly generating actions using the parameterized policy
$\pi_{\theta}$. The Critic network estimates the state-action value function $Q_{\pi}(s,\:
a)$ or advantage function $A_{\pi}(s,\: a)$, providing feedback to guide the actor's
optimization. During policy updates, the actor adjusts its parameters based on feedback
from the critic, with the update following the policy gradient theorem defined as:
where $\nabla_{\theta}\log\pi_{\theta}(a | s)$ represents the policy gradient, and
$Q_{\pi}(s,\: a)$ estimated by the critic, guides the actor's action adjustments.
Despite the success of Actor-Critic methods in many complex tasks, they are prone
to issues like high variance, slow convergence, and local optima. Therefore, various
variants have been developed to improve their performance, including:
Advantaged Actor-Critic (A2C): A2C [39] introduces the advantage function $A(s,\: a)=Q_{\pi_{\theta}}(s,\: a)- V(s)$, where
the Q-function is replaced by the difference between the state-action value and the
state value function $V(s)$, reducing the variance of the policy gradient and improving
stability.
Asynchronous Actor Critic (A3C): A3C [40] improves sample efficiency and stability by training multiple agents in parallel.
Each agent interacts with the environment using different exploration policies and
the global parameters are updated synchronously across agents, enhancing convergence
speed and performance.
Soft Actor Critic (SAC): SAC [41] operates under the maximum entropy RL framework, using stochastic policies and incorporating
an entropy term $H(\pi_{\theta}(\bullet | s_{t}))$ in the objective function to encourage
exploration and improve policy robustness, while maintaining efficient learning.
2.3.3 Proximal Policy Optimization (PPO)
PPO [42] is a policy gradient-based optimization method that balances stability and sample
efficiency, widely applied in tasks involving both continuous and discrete action
spaces. The core idea of PPO is to limit the magnitude of policy updates, preventing
the policy from collapsing due to large updates and ensuring a smoother training process.
The objective function in PPO is based on clipping, which limits the magnitude of
changes between the old and new policies. The objective function is:
where $r_{t}(\theta)=\dfrac{\pi_{\theta}(a | s)}{\pi_{\theta_{old}}(a | s)}$ is the
probability ratio between the new and old policies, $\hat{A_{t}}$ is the estimated
advantage function, and $\epsilon$ is a hyperparameter controlling the update range.
By using this clipping mechanism, PPO ensures that the magnitude of policy updates
remains within a specified range, preventing the risk of policy collapse.
Despite PPO's strong performance, policy exploration remains a challenge in high-dimensional
and complex tasks. To ensure a broad exploration, entropy regularization can be introduced
to PPO, maintaining the randomness of the policy and preventing premature convergence
to local optima.
2.3.4 Hierarchical Reinforcement Learning (HRL)
HRL [43] improves learning efficiency and addresses complex problems by decomposing tasks
into multiple hierarchical sub-tasks. The key idea of HRL is to introduce agents at
different levels, where higher-level agents select abstract sub-tasks or "options,"
while lower-level agents focus on executing these sub-tasks. This hierarchical structure
effectively reduces the decision space, speeds up convergence, and performs well in
long-term decision-making problems.
One significant advantage of HRL is that it allows agents to learn at different levels,
improving the generalization ability of the policy. This structure is particularly
suitable for complex tasks in power systems, such as topology control and emergency
load shedding. The high-level policy determines whether to adjust the topology or
shed load during grid failures, while the low-level policy focuses on specific actions,
such as shutting down transmission lines or disconnecting loads. By task decomposition,
HRL significantly reduces the complexity of learning and enhances the system's response
efficiency and stability.
However, HRL faces challenges such as defining and designing appropriate hierarchies
and sub-tasks. Overly complex hierarchies may destabilize the training process, while
overly simple hierarchies may not fully leverage the advantages of hierarchical policies.
Coordination between high-level and low-level policies is also a challenge, ensuring
effective collaboration between different levels of policies in long-term and short-term
goals. Furthermore, HRL typically requires longer training times, making training
efficiency a bottleneck.
2.3.5 Safe Reinforcement Learning (SRL)
SRL [44] focuses on ensuring system safety during the RL process, particularly in high-risk
fields like power system control, where unsafe decisions must be avoided. In traditional
RL, agents explore and interact with the environment, trying different policies to
maximize cumulative rewards. However, this free exploration can lead to unsafe behaviors,
especially in power systems, where unsafe actions may cause instability, equipment
damage, or serious service disruptions. SRL aims to optimize long-term returns while
constraining agent behavior, ensuring that safety limits are not violated during learning
and in the final policy.
SRL methods often introduce constraints to ensure safety. A common approach is to
embed safety constraints into the optimization objective, forming a constrained RL
problem. The agent must not only maximize rewards but also satisfy specific safety
constraints. These constraints can be enforced through penalty mechanisms, where the
agent is penalized for taking unsafe actions, forcing it to optimize behavior to avoid
future constraint violations.
Another method is to adopt a safe exploration policy, limiting
the agent's action space during exploration to ensure that dangerous behaviors are
not executed. For example, model-based SRL methods build an environment model to predict
potential outcomes under different policies and avoid executing high-risk actions
in advance.
SRL methods offer significant advantages in many applications. First, they ensure
that agents follow safety constraints during both learning and policy execution, avoiding
dangerous behaviors. This is particularly useful in high-risk fields, where SRL can
effectively reduce system failures and losses by balancing policy optimization and
safety constraints. Moreover, SRL's safe exploration mechanisms restrict unsafe operations
during training, preventing system damage from improper exploration. SRL also improves
the robustness of agents, helping them maintain stable performance in uncertain and
unexpected environments.
However, implementing SRL also poses challenges. Designing appropriate safety constraints
is a key challenge—constraints that are too strict can limit the agent's learning
space, while overly loose constraints may lead to safety risks. SRL also faces computational
pressures in high-dimensional dynamic environments, especially in complex systems
where computing resources and training time become bottlenecks. Furthermore, safety
constraints can limit the agent's exploration ability, affecting the efficiency of
policy optimization and convergence speed. Therefore, balancing exploration and exploitation
while ensuring safety remains a key challenge in SRL implementation.
3. Application in Power System
In recent years, DRL has been widely used in maintaining power system stability. DRL
techniques enable more efficient optimization of the execution of energy dispatch,
topology control, and emergency load shedding, which are key measures for ensuring
power system stability. Compared with traditional control methods, DRL can provide
more accurate control and significantly improve system stability and response efficiency
when dealing with complex grid environments. In the following, we will introduce
the specific applications of DRL in the above tasks respectively. Table 1 in Appendix summarizes the applications of DRL algorithms in power systems.
3.1 Energy Dispatch
The growth of distributed energy and electric vehicles makes balancing power supply
and demand more critical, while grid scaling increases uncertainty. DRL, as a data-driven
method, offers adaptability by optimizing energy dispatch without relying on precise
models, learning through interaction with the grid. Fig. 3 visually illustrates the energy dispatch problem in a small power system, covering
key elements such as renewable energy sources, generating stations, loads, and energy
storage devices. The core objective of using DRL in this system is to ensure a balance
between supply and demand and to minimize energy losses. Fig. 4 further details the operational flow of the DRL methodology for optimizing the energy
dispatch problem and thereby reducing energy losses.
Fig. 3. Power system energy dispatch of [Fig. 7, 14].
Fig. 4. DRL-based methods for optimal energy dispatch in power systems [Fig. 1, 48].
Literature [45] models the power scheduling process as a dynamic sequential control problem, and
proposes an optimized coordinated scheduling policy to cope with wind and demand perturbations
by modeling the Markov decision process and combining Monte Carlo methods and Q-learning
RL algorithms in order to reduce the long term operation and maintenance costs. Literature
[46] employs an improved DRL technique of deep deterministic policy gradient (DDPG) to
address the limitations of existing dispatch schemes that rely on forecasting and
modeling by modeling the dynamic dispatch problem as an MDP, thus achieving an adaptive
response to the uncertainty of renewable energy generation and demand fluctuations
in an integrated energy system (IES). Literature 47 proposes a robust and scalable deep Q-learning (DQN) based DRL optimization algorithm
to solve the problem of balancing economic cost and environmental emission in renewable
energy integrated electricity scheduling by decomposing the power scheduling problem
into MDPs and training DRL models in a multi-intelligence body simulation environment.
Literature [48] proposes a soft actor-critic (SAC) based autonomous control approach to address the
challenges posed by large-scale renewable energy integration for active power scheduling
in modern power systems by introducing Lagrange multipliers and imitation learning,
which significantly improves the consumption rate of renewable energy sources and
the robustness of the algorithm. It is worth noting that literature [45]-[48] has not compared traditional control methods, but their performance in grid control
is still excellent. In order to effectively demonstrate the advantages of DRL, literature
[49], [14] are compared and analyzed with traditional methods. Literature [49] proposes an optimal scheduling method based on asynchronous advantage actor-critic
(A3C) DRL algorithm, which can cope with the uncertainty of renewable energy sources
and users' energy demand in the IES by constructing the state space, action space,
and reward function, thus realizing the economic scheduling of the system and the
complementarity of multiple energy sources. It is worth stating that literature [49] compares with traditional methods and achieves the same performance as mathematical
planning methods. Literature [14] adopts a SAC-based DRL method to solve the energy dispatch problem in distributed
grids by optimizing the reward function related to energy dispatch to find the optimal
policy. Experimental results show that the method is able to achieve similar results
with the traditional model predictive control (MPC) algorithm in a 6-bus power system.
Table 1 Task types and applications of DRL algorithms in power systems (Energy dispatch:
ED / Topology control: TC / Emergency load shedding: ELS).
Literature
|
Field
|
Algorithm
|
Type
|
Objective
|
Improvement
|
45
|
ED
|
Q-learning
|
Value-based
|
Cost reduction
|
Distinguishing from traditional short-term cost optimization, the model provides long-term
optimal scheduling.
|
46
|
ED
|
DDPG
|
Actor-Critic
|
Improving adaptive responses to uncertainty in the grid
|
The method is more adaptable than traditional methods, as it does not need to rely
on predictive information or knowledge of the system's uncertainty distribution.
|
47
|
ED
|
DQN
|
Value-based
|
Cost reduction
|
The model can handle multiple objectives to accommodate the complexity of modern power
systems.
|
48
|
ED
|
SAC
|
Actor-Critic
|
Cost reduction
|
The model enhances the robustness of the system and reduces the consumption rate of
renewable energy.
|
49
|
ED
|
A3C
|
Actor-Critic
|
Improving adaptive responses to uncertainty in the grid
|
The model outputs dispatch policies in real time, which avoids relying on accurate
source load forecasts and effectively copes with source load uncertainty and volatility.
|
14
|
ED
|
SAC
|
Actor-Critic
|
Cost reduction
|
The model is enhanced for the DRL policy and enhances the security of the policy.
|
50
|
TC
|
DDDQN
|
Value-based
|
Safe operation of the power grid
|
This literature suggests applying the DRL algorithm to larger, more constrained power
systems and incorporating more control variables (e.g., line switching, transformer
regulation, generation scheduling).
|
51
|
TC
|
Cross-Entropy
|
Policy-based
|
Safe operation of the power grid
|
The model further analyzes the topology control behavior of the agents and illustrates
the ability to improve the generalization of the model while maintaining the simplicity
of the approach.
|
52
|
TC
|
AC
|
Actor-Critic
|
Improve topology control accuracy
|
The model effectively solves the problem of sparse rewards and high-dimensional state
spaces in the grid environment.
|
53
|
TC
|
SAC
|
Actor-Critic
|
Improve topology control accuracy
|
The model improves the accuracy of the DRL algorithm by introducing an attention mechanism.
|
54
|
TC
|
SAC
|
Actor-Critic
|
Safe operation of the power grid
|
The model designs a pre-training scheme for the SAC algorithm to improve the robustness
and efficiency of the algorithm.
|
55
|
ELS
|
PARS
|
Actor-Critic
|
Improving adaptive responses to uncertainty in the grid
|
The model utilizes the derivative-free nature and parallelism of the proposed algorithm
to substantially improve the training efficiency.
|
56
|
ELS
|
DDDQN
|
Value-based
|
Improve emergency load shedding accuracy
|
The model effectively extracts the topological features of the grid, optimizes the
emergency control policy, and improves the stability and economy of the grid under
frequent topological changes.
|
57
|
ELS
|
DDPG
|
Actor-Critic
|
Safe operation of the power grid
|
The model improves the training process and enhances the generalization of the control
policy by introducing voltage information as a reward function.
|
58
|
ELS
|
DQN
|
Value-based
|
Improve emergency load shedding accuracy
|
The model improves the system's adaptability and online operational performance in
unseen scenarios through spatio-temporal information modeling and the application
of control policies.
|
59
|
ELS
|
PPO
|
Policy-based
|
Improving adaptive responses to uncertainty in the grid
|
The model enhances the frequency recovery capability of the system through DRL optimized
control policy and can avoid triggering the system safety constraints.
|
60
|
ELS
|
DDQN
|
Value-based
|
Improve emergency load shedding accuracy
|
The model improves training efficiency and decision quality through knowledge-enhanced
DRL.
|
In summary, the application of DRL in power system scheduling significantly improves
the adaptive capability and robustness of the system, enabling it to effectively cope
with fluctuations in renewable energy sources and demand uncertainty. By optimizing
the scheduling policy and reducing the reliance on complex mathematical models, DRL
provides new solutions for achieving cost-effective energy management.
3.2 Topology Control
Topology control is crucial for maintaining power system stability by dynamically
adjusting transmission line connections, optimizing current distribution, and reducing
load on specific lines or nodes. The challenge lies in quickly finding the optimal
solution within the complex topology while meeting real-time regulation and stability
requirements. Fig. 5 illustrates a topology control problem for a small power system consisting of substations,
loads, generators, and energy storage devices. Topology control aims to optimize system
operation by adjusting the configuration of power lines or substations to improve
power supply reliability and efficiency. Fig. 6 further illustrates how the DRL operates in the topology control task, Literature
[50] uses a DRL method based on dueling double where the topology control is mainly optimized
by dynamically adjusting the substation configuration.
Fig. 5. Power system topology control of [Fig. 1, 50].
Fig. 6. DRL-based methods for topology control problems in power systems [Fig. 2, 52].
deep Q-Learning (DDDQN) and prioritized experience playback to achieve secure power
system operation through autonomous topology adjustment, which solves the problem
of traditional methods that are difficult to cope with complex grid control. Literature
[51] uses Cross-Entropy Method RL approach to train artificial intelligent agents to control
power flow in the power grid through topology switching operations to solve the stability
problem of power grid operation under uncertain generation and demand conditions.
Facing the problem of high-dimensional topological space in power systems, literature
[52] proposes a HRL-based method for grid topology regulation. The method extends the
actor-critic model in DRL to a hierarchical structure, where the upper layer generates
a desired topology configuration scheme based on the current state of the grid, and
the lower layer is responsible for executing specific policies to achieve this goal.
With this hierarchical architecture, the complexity of the high-dimensional state-action
space in topology control is effectively mitigated, which improves the accuracy and
efficiency of the control policy. Literature [53] proposes a DRL method for SAC incorporating an attention mechanism, aiming to manage
the power system by adjusting the topology of the grid. The method improves the robustness
and computational efficiency of the model by assigning different feature weights so
that the attention mechanism allows the neural network to focus on the input features
that are most relevant to the current target task. Literature [54] proposes a DRL-based approach to achieve stable autonomous control of power systems
through autonomous topology optimization control using the SAC algorithm, while introducing
an imitation learning (IL)-based pre-training scheme to cope with the huge action
space in topology switching and the vulnerability of DRL agents in power systems.
In summary, recent studies have demonstrated the great potential of DRL in power system
topology control. Aiming at the complex and huge topology space problem, which is
difficult to cope with by traditional methods, these emerging methods seek the optimal
policy through autonomous decision-making and dynamic adjustment, which not only improves
the stability and reliability of the power system but also enhances its ability to
cope with unexpected situations.
3.3 Emergency Load Shedding
Emergency load shedding is a key measure for maintaining power system stability during
faults, overloads, or unexpected events. It reduces grid stress and protects equipment,
but the challenge lies in making quick decisions, assessing load priorities, coordinating
equipment communication, and managing diverse load characteristics for efficient shedding.
As shown in Fig. 7, the emergency load shedding problem in a medium-sized power system is illustrated,
where Bus 4, Bus 7, and Bus 18 are heavily loaded areas. During an overload event,
shedding loads in these areas helps alleviate system stress. Fig. 8 demonstrates that during an emergency load shedding task, voltage initially drops
due to the fault but gradually recovers as the load is reduced. The curve represents
the voltage recovery standard, indicating that voltage should remain above the curve
throughout the recovery process. To optimize load-shedding decisions, Fig. 9 presents a flowchart of using DRL to search for an effective shedding policy.
In [55], an accelerated DRL algorithm named “PARS” was developed to solve the problems of
low computational efficiency and poor scalability of existing load shedding methods
in power system voltage stabilization control, and to improve the power system stability
under uncertainty and rapidly changing operating conditions through efficient and
fast adaptive control. Stability. Literature [56] proposed a contingency control scheme for undervoltage load shedding (UVLS) based
on the GraphSAGE-DDDQN method, aiming at solving the problem of insufficient adaptability
and generalization ability of the existing UVLS technology in coping with the scenarios
of topology changes in power networks, so as to improve the reliability and economy
of the control policy. Literature [57] proposes a load shedding control policy based on DDPG deep strong chemistry to address
the challenge of realizing autonomous voltage control in the event of power system
faults, which effectively improves the stable operation of the power system by constructing
a network training dataset and establishing a reward function that conforms to the
operational characteristics of the power grid. Literature [58] proposes a load shedding policy based on deep Q-network (DQN-LS) and convolutional
long and short-term memory network (ConvLSTM), aiming to improve the stability and
voltage restoration capability of large-scale power systems in dynamic load shedding
problems through real-time, fast and accurate load shedding decisions, especially
under different and uncertain power system fault conditions. Literature [59] proposes a data-driven and DRL-based emergency load curtailment approach to improve
the system by transforming the load curtailment policy into a MDP and optimizing
it using a proximal policy optimization (PPO) algorithm to solve the problems of
model complexity and matching risk faced by the traditional event-driven load curtailment
policies in renewable energy systems, thus improving the system's adaptability and
efficiency under multi-fault scenarios. Literature [60] proposes a knowledge-enhanced DDQN DRL approach for intelligent event-driven load
shedding (ELS), which solves the deficiencies of traditional methods in terms of
efficiency and timeliness by building a MDP based on transient stability simulation
and incorporating the knowledge of removing repetitive and negative actions in order
to improve the training efficiency and the quality of decision making for the effective
formulation of load shedding measures. shortcomings in efficiency and timeliness.
Fig. 7. Power system emergency load shedding of [Fig. 13, 14].
Fig. 8. Power system emergency load shedding voltage recovery curve of [Fig. 4, 14].
Fig. 9. DRL-based methods for emergency load shedding in power systems [Fig. 3, 61].
In summary, recent studies have demonstrated the potential of multiple DRL-based load
curtailment control methods in dealing with power system contingencies. With innovative
algorithm design and flexible decision-making mechanisms, these methods effectively
enhance the stability and responsiveness of the power system in the face of faults
and uncertainties, while improving the accuracy of load shedding.
4. Challenges and Future Directions
Although DRL has achieved many successes in enhancing power system stability through
three main means, namely energy dispatch, topology control, and emergency load shedding,
it still faces several challenges, including the fact that current research is mainly
focused on single-task optimization and lacks consideration of multi-task coordination,
the issue of uncertainty and diversity integration of renewable energy sources, and
how to achieve a reliable deployment of DRL while safeguarding the security of the
power grid. In addition, the difference between the simulation environment and the
real grid operation (Sim2Real) limits the wide application of DRL in real power systems.
Therefore, in this section, we systematically discuss these key challenges and explore
feasible solutions to promote further the research and development of DRL in power
systems.
4.1 Multi-task Coordination
In current studies, energy dispatch, topology control and emergency load shedding
are usually modeled and optimized separately as individual tasks. However, in the
operation of real power systems, these tasks are highly interdependent and the interactions
may lead to local optimization problems rather than global optimization. Traditional
methods often rely on heuristic search or staged optimization, which lacks a global
perspective and leads to difficulties in achieving efficient synergy between energy
scheduling, topology control, and emergency load shedding. At the same time, most
of the existing DRL studies use single-task learning, ignoring the intrinsic connection
between these three, resulting in policies that are difficult to generalize to multi-task
scenarios. Existing reinforcement learning methods, such as DQN and PPO, are mainly
aimed at single-objective optimization, which makes it difficult to take into account
the coordination and optimization of multiple control means in a complex power system
environment. Therefore, future studies should explore the construction of a unified
multi-task coordination framework, as shown in Fig. 10, in which DRL can dynamically switch between different tasks and perform comprehensive
optimization under a global perspective, thereby enhancing the security and efficiency
of the entire power grid. Multi-task learning (MTL) can be used to develop DRL models
capable of simultaneously optimizing energy dispatch, topology control and emergency
load shedding to leverage information sharing across tasks. Meanwhile, hierarchical
reinforcement learning (HRL) can be used to improve the overall flexibility and adaptability
of decision-making by constructing high-level and low-level policies, which enable
high-level decision-making to intelligently select appropriate control methods, while
low-level policies are responsible for the specific execution of the corresponding
optimization tasks. In addition, the multi-objective optimization (MORL) approach
can adaptively adjust the priorities of the three control means by introducing weighting
factors to achieve integrated scheduling optimization, thus avoiding local optimization
problems and promoting global stability. In terms of constructing global features,
it can be combined with graph neural network (GNN), self-attention and other methods,
so that the DRL can learn the dynamic changes of the grid structure, and combine them
with the optimization policies, such as energy storage management, to improve the
system responsiveness when dealing with unexpected events. This multi-task coordination
approach not only enhances the flexibility of the grid but also improves the robustness
and reliability of decision-making in the face of uncertainty challenges, thus promoting
the development of intelligent regulation of power systems in the direction of greater
efficiency and security.
Fig. 10. DRL-based multi-task coordination framework.
4.2 Renewable Energy Integration
As renewable energy sources such as wind and solar are increasingly integrated into
modern power systems, managing their variability and uncertainty has become a key
challenge in DRL studies. As shown in Fig. 11, current research has focused on optimizing renewable energy consumption in specific
scenarios, while future research should focus on how to effectively integrate multiple
renewable energy sources and optimize their dynamic allocation in the grid. Different
types of renewable energy sources (e.g., wind, photovoltaic, hydropower, etc.) have
significant differences in output patterns, temporal characteristics, and spatial
distributions, and the standard DRL methods tend to assume that the input characteristics
of the power system are fixed while ignoring the system dynamics brought about by
the changes in the proportion of renewable energy sources, which makes the existing
methods lack of adaptability and generalization when facing the multi-source integration
problem. Therefore, future research directions should consider how to enable DRL to
quickly adapt to different renewable energy environments and improve its ability to
cope with grid volatility. For example, adopting a DRL method based on Meta-Learning
(ML) can enable the agent to learn and adapt itself quickly in different types of
renewable energy environments and improve its ability to cope with changes in system
dynamics. In addition, the introduction of uncertainty modeling techniques, such as
Bayesian DRL (Bayesian DRL) or probabilistic graphical models (PGMs), can enable DRL
to deal with the uncertainty of renewable energy sources more efficiently, thus improving
the robustness of scheduling decisions. There have been studies attempting to optimize
the co-dispatch of wind and PV systems using DRL in combination with energy storage
management, but the challenges of algorithms' generalization ability and adaptability
are still being faced. Therefore, the future development direction should focus on
how to improve the stability of DRL under uncertain environments so that it can effectively
integrate different renewable energy resources under dynamic grid conditions, and
ultimately improve the overall stability and operational efficiency of the grid.
Fig. 11. Challenges of DRL in integrating multiple renewable energies.
4.3 Safety Constraints and Sim2Real
The issue of security constraints is crucial in the practical application of DRL in
power systems, but existing DRL methods may explore infeasible or even dangerous decisions
during the training process, such as over-adjusting the topology, which instead increases
the vulnerability of the grid, or adopting extreme load shedding policies, which leads
to power tidal imbalance or even violates the grid security standards. Therefore,
as shown in Fig. 12, future research needs to explore how to incorporate domain knowledge of the power
system into the DRL decision framework and introduce physical security constraints
during the training process to ensure that the policies learned by the model are always
in line with grid security requirements. For example, constrained DRL or barrier function
methods can be used to enable DRL to adaptively avoid security risks during the learning
process and to ensure that its control policy does not destabilize the power grid.
In addition, the application of DRL in power systems faces the Sim2Real (from simulation
to reality) problem, which means that the difference between the simulation environment
and the actual grid operation may cause the trained model to fail in the real environment.
Existing research mainly relies on the simulation environment for DRL training, but
the deviation between the simulation model and the physical characteristics of the
real grid leads to the fact that the DRL policy may perform well in the simulation
environment but has insufficient generalization ability to cope with the complex operating
conditions of the real grid when deployed in practice. Therefore, future research
should focus on exploring how to narrow this gap, such as using methods such as transfer
learning and imitation learning to make DRL's policy transfer between different environments
more stable or adopting model-free online DRL, which enables the agent to continuously
learn and optimize the control policy directly in the real grid, thus improving its
practical adaptability. In addition, the interpretability problem of DRL remains a
key challenge limiting its application in grid control, as its decision logic is often
hard to understand, making power engineers hesitant to trust its policies, thus affecting
deployment in safety-critical systems. To address this, approaches like attention
mechanisms or causal inference can enhance interpretability, making decision-making
more transparent while providing a visualized basis, thereby improving the trustworthiness
and usefulness of DRLs in power systems.
Fig. 12. Safety learning and Sim2Real gap minimization.
5. Conclusion
Amid global decarbonization efforts, modern power systems are becoming increasingly
complex due to the large-scale integration of renewable energy, posing significant
challenges to system stability and operational efficiency. DRL has emerged as a promising
solution to address these challenges, offering adaptive learning and decision-making
capabilities that surpass traditional optimization methods in high-dimensional and
dynamic environments. This paper provides a systematic overview of DRL applications
in power systems, with a particular focus on its optimization strategies for energy
dispatch, topology control, and emergency load shedding.
Our findings highlight the significant advancements of DRL in optimizing these control
measures, demonstrating its potential to enhance power system stability, flexibility,
and resilience. However, key challenges remain, including multi-task coordination,
renewable energy integration, safety constraints, and Sim2Real transferability. Addressing
these challenges will ensure the practical deployment and effectiveness of DRL-based
solutions in real-world power systems.
As power grids continue to evolve, the insights provided in this paper establish a
foundation for future research, guiding the development of more robust, efficient,
and safe DRL frameworks. Advancing these research directions will not only drive innovation
in power system control but also play a crucial role in supporting the transition
toward a more sustainable and intelligent energy infrastructure.
Acknowledgements
This research was supported in part by the KEPCO under the project entitled by “Development
of GW class voltage sourced DC linkage technology for improved interconnectivity and
carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in
part by the Institute of Information & communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).
References
R. Detchon and R. Van Leeuwen, “Policy: Bring sustainable energy to the developing
world,” Nature, vol. 508, no. 7496, pp. 309–311, 2014. DOI:10.1038/508309a

H. Hu, N. Xie, D. Fang and X. Zhang, “The role of renewable energy consumption and
commercial services trade in carbon dioxide reduction: Evidence from 25 developing
countries,” Applied energy, vol. 211, pp. 1229–1244, 2018. DOI:10.1016/j.apenergy.2017.12.019

J. Wu, J. Yan, H. Jia, N. Hatziargyriou, N. Djilali and H. Sun, “Integrated energy
systems,” Applied energy, vol. 167, pp. 155–157, 2016. DOI:10.1016/j.apenergy.2016.02.075

B. Kroposki, “Integrating high levels of variable renewable energy into electric power
systems,” Journal of Modern Power Systems and Clean Energy, vol. 5, no. 6, pp. 831–837,
2017. DOI:10.1007/s40565-017-0339-3

M. L. Tuballa and M. L. Abundo, “A review of the development of smart grid technologies,”
Renewable and Sustainable Energy Reviews, vol. 59, pp. 710–725, 2016. DOI:10.1016/j.rser.2016.01.011

J. Keirstead, M. Jennings and A. Sivakumar, “A review of urban energy system models:
Approaches, challenges and opportunities,” Renewable and Sustainable Energy Reviews,
vol. 16, no. 6, pp. 3847–3866, 2012. DOI:10.1016/j.rser.2012.02.047

M. F. Zia, E. Elbouchikhi and M. Benbouzid, “Microgrids energy management systems:
A critical review on methods, solutions, and prospects,” Applied energy, vol. 222,
pp. 1033–1055, 2018. DOI:10.1016/j.apenergy.2018.04.103

S. Impram, S. V. Nese and B. Oral, “Challenges of renewable energy penetration on
power system flexibility: A survey,” Energy Strategy Reviews, vol. 31, no. 100539,
pp. 1-12, 2020. DOI:10.1016/j.esr.2020.100539

D. Liu, Q. Yang, Y. Chen, X. Chen and J. Wen, “Optimal parameters and placement of
hybrid energy storage systems for frequency stability improvement,” Protection and
Control of Modern Power Systems, vol. 10, no. 2, pp. 40–53, 2025. DOI:10.23919/PCMP.2023.000259

K. Liu, Z. Chen, X. Li and Y. Gao, “Analysis and control parameters optimization of
wind turbines participating in power system primary frequency regulation with the
consideration of secondary frequency drop,” Energies, vol. 18, no. 6, pp. 1–19, 2025.
DOI:10.3390/en18061317

M. Dahane, A. Benali, H. Tedjini, A. Benhammou, M. A. Hartani and H. Rezk, “Optimized
double-stage fractional order controllers for dfig-based wind energy systems: A comparative
study,” Results in Engineering, vol. 25, no. 104584, pp. 1-17, 2025. DOI:10.1016/j.rineng.2025.104584

L. Cheng and T. Yu, “A new generation of ai: A review and perspective on machine learning
technologies applied to smart energy and electric power systems,” International Journal
of Energy Research, vol. 43, no. 6, pp. 1928–1973, 2019. DOI:10.1002/er.4333

M. M. Gajjala and A. Ahmad, “A survey on recent advances in transmission congestion
management,” International Review of Applied Sciences and Engineering, vol. 13, no.
1, pp. 29–41, 2021. DOI:10.1556/1848.2021.00286

H. Zhang, X. Sun, M. H. Lee and J. Moon, “Deep reinforcement learning based active
network management and emergency load-shedding control for power systems,” IEEE Transactions
on Smart Grid, vol. 15, no. 2, pp. 1423-1437, 2023. DOI:10.1109/TSG.2023.3302846

S. M. Mohseni-Bonab, I. Kamwa, A. Rabiee and C. Chung, “Stochastic optimal transmission
switching: A novel approach to enhance power grid security margins through vulnerability
mitigation under renewables uncertainties,” Applied Energy, vol. 305, no. 117851,
pp. 1-14, 2022.DOI:10.1016/j.apenergy.2021.117851

D. Michaelson, H. Mahmood and J. Jiang, “A predictive energy management system using
pre-emptive load shedding for islanded photovoltaic microgrids,” IEEE Transactions
on Industrial Electronics, vol. 64, no. 7, pp. 5440–5448, 2017. DOI:10.1109/TIE.2017.2677317

R. S. Sutton, “Reinforcement learning: An introduction,” A Bradford Book, 2018. DOI:10.1017/S0263574799271172

D. Cao, W. Hu, J. Zhao, G. Zhang, B. Zhang, Z. Liu, Z. Chen and F. Blaabjerg, “Reinforcement
learning and its applications in modern power and energy systems: A review,” Journal
of modern power systems and clean energy, vol. 8, no. 6, pp. 1029–1042, 2020. DOI:10.35833/MPCE.2020.000552

E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu and J.
G. Slootweg, “On-line building energy optimization using deep reinforcement learning,”
IEEE transactions on smart grid, vol. 10, no. 4, pp. 3698–3708, 2018. DOI:10.1109/TSG.2018.2834219

Y. Zhang, X. Wang, J. Wang and Y. Zhang, “Deep reinforcement learning based volt-var
optimization in smart distribution systems,” IEEE Transactions on Smart Grid, vol.
12, no. 1, pp. 361–371, 2020. DOI:10.1109/TSG.2020.3010130

Z. Yan and Y. Xu, “Data-driven load frequency control for stochastic power systems:
A deep reinforcement learning method with continuous action search,” IEEE Transactions
on Power Systems, vol. 34, no. 2, pp. 1653–1656, 2018. DOI:10.1109/TPWRS.2018.2881359

Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan and Z. Huang, “Adaptive power system emergency
control using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol.
11, no. 2, pp. 1171–1182, 2019. DOI:10.1109/TSG.2019.2933191

Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for power system applications:
An overview,” CSEE Journal of Power and Energy Systems, vol. 6, no. 1, pp. 213–225,
2019. DOI:10.17775/CSEEJPES.2019.00920

Q. Li, T. Lin, Q. Yu, H. Du, J. Li and X. Fu, “Review of deep reinforcement learning
and its application in modern renewable power system control,” Energies, vol. 16,
no. 10, pp. 1–23, 2023. DOI:10.3390/en16104143

J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Machine
learning, vol. 16, pp. 185–202, 1994. DOI:10.1007/BF00993306

A. Agarwal, S. M. Kakade, J. D. Lee and G. Mahajan, “Optimality and approximation
with policy gradient methods in markov decision processes,” in Conference on Learning
Theory.PMLR, vol. 125, pp. 64–66, 2020. https://proceedings.mlr.press/v125/agarwal20a.html

H. Wang and B. Raj, “On the origin of deep learning,” arXiv preprint arXiv:1702.07800,
2017. DOI:10.48550/arXiv.1702.07800

Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp.
436–444, 2015. DOI:10.1038/nature14539

R. Sun, “Optimization for deep learning: theory and algorithms,” arXiv preprint arXiv:1912.08957,
2019. DOI:10.48550/arXiv.1912.08957

J. Tsitsiklis and B. Van Roy, “Analysis of temporal-diffference learning with function
approximation,” Advances in neural information processing systems, vol. 9, pp. 1-7,
1996. https://proceedings.neurips.cc/paper_files/paper/1996/file/e00406144c1e7e35240afed70f34166a-Paper.pdf

C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, pp. 279–292, 1992.
DOI:10.1007/BF00992698

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through
deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:10.1038/nature14236

J. Fan, Z. Wang, Y. Xie and Z. Yang, “A theoretical analysis of deep q-learning,”
in Learning for dynamics and control. PMLR, vol. 120, pp. 486–489, 2020. https://proceedings.mlr.press/v120/yang20a.html

H. Van Hasselt, A. Guez and D. Silver, “Deep reinforcement learning with double q-learning,”
in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1,
pp. 2094-2100, 2016. DOI:10.1609/aaai.v30i1.10295

Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot and N. Freitas, “Dueling network
architectures for deep reinforcement learning,” in International conference on machine
learning.PMLR, vol. 48, pp. 1995–2003, 2016. https://proceedings.mlr.press/v48/wangf16.html

D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, “Deterministic
policy gradient algorithms,” in International conference on machine learning. PMLR,
vol. 32, no. 1, pp. 387–395, 2014. https://proceedings.mlr.press/v32/silver14.html

R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for
reinforcement learning with function approximation,” Advances in neural information
processing systems, vol. 12, pp. 1-7, 1999. https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

T. Degris, M. White and R. S. Sutton, “Off-policy actor-critic,” arXiv preprint arXiv:1205.4839,
2012. DOI:10.48550/arXiv.1205.4839

S. Li, S. Bing and S. Yang, “Distributional advantage actor-critic,” arXiv preprint
arXiv:1806.06914, 2018. DOI:10.48550/arXiv.1806.06914

V. Mnih, “Asynchronous methods for deep reinforcement learning,” arXiv preprint arXiv:1602.01783,
2016. DOI:10.48550/arXiv.1602.01783

T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu,
A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv
preprint arXiv:1812.05905, 2018. DOI:10.48550/arXiv.1812.05905

J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization
algorithms,” arXiv preprint arXiv:1707.06347, 2017. DOI:10.48550/arXiv.1707.06347

S. Pateria, B. Subagdja, A.-h. Tan and C. Quek, “Hierarchical reinforcement learning:
A comprehensive survey,” ACM Computing Surveys (CSUR), vol. 54, no. 5, pp. 1–35, 2021.
DOI:10.1145/3453160

S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang and A. Knoll, “A review of safe
reinforcement learning: Methods, theories and applications,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 11216–11235, 2024.
DOI:10.1109/TPAMI.2024.3457538

F. Meng, Y. Bai and J. Jin, “An advanced real-time dispatching strategy for a distributed
energy system based on the reinforcement learning algorithm,” Renewable energy, vol.
178, pp. 13–24, 2021. DOI:10.1016/j.renene.2021.06.032

T. Yang, L. Zhao, W. Li and A. Y. Zomaya, “Dynamic energy dispatch strategy for integrated
energy system based on improved deep reinforcement learning,” Energy, vol. 235, no.
121377, pp. 1-15, 2021. DOI:10.1016/j.energy.2021.121377

A. S. Ebrie and Y. J. Kim, “Reinforcement learning-based optimization for power scheduling
in a renewable energy connected grid,” Renewable Energy, vol. 230, no. 120886, pp.
1-27, 2024. DOI:10.1016/j.renene.2024.120886

X. Han, C. Mu, J. Yan and Z. Niu, “An autonomous control technology based on deep
reinforcement learning for optimal active power dispatch,” International Journal of
Electrical Power & Energy Systems, vol. 145, no. 108686, pp. 1-10, 2023. DOI:10.1016/j.ijepes.2022.108686

X. Zhou, J. Wang, X. Wang and S. Chen, “Optimal dispatch of integrated energy system
based on deep reinforcement learning,” Energy Reports, vol. 9, pp. 373–378, 2023.
DOI:10.1016/j.egyr.2023.09.157

I. Damjanović, I. Pavić, M. Puljiz and M. Brcic, “Deep reinforcement learning-based
approach for autonomous power flow control using only topology changes,” Energies,
vol. 15, no. 19, pp. 1-16, 2022. DOI:10.3390/en15196920

M. Subramanian, J. Viebahn, S. H. Tindemans, B. Donnot and A. Marot, “Exploring grid
topology reconfiguration using a simple deep reinforcement learning approach,” in
2021 IEEE Madrid PowerTech, pp. 1–6, 2021. DOI:10.1109/PowerTech46648.2021.9494879

Z. Yang, Z. Qiu, Y. Wang, C. Yan, X. Yang and G. Deconinck, “Power grid topology regulation
method based on hierarchical reinforcement learning,” in 2024 Second International
Conference on Cyber-Energy Systems and Intelligent Energy (ICCSIE), pp. 1–6, 2024.
DOI:10.1109/ICCSIE61360.2024.10698617

Z. Qiu, Y. Zhao, W. Shi, F. Su and Z. Zhu, “Distribution network topology control
using attention mechanism-based deep reinforcement learning,” in 2022 4th International
Conference on Electrical Engineering and Control Technologies (CEECT), pp. 55–60,
2022. DOI:10.1109/CEECT55960.2022.10030642

X. Han, Y. Hao, Z. Chong, S. Ma and C. Mu, “Deep reinforcement learning based autonomous
control approach for power system topology optimization,” in 2022 41st Chinese Control
Conference (CCC), pp. 6041–6046, 2022. DOI:10.23919/CCC55666.2022.9902073

R. Huang, Y. Chen, T. Yin, X. Li, A. Li, J. Tan, W. Yu, Y. Liu and Q. Huang, “Accelerated
deep reinforcement learning based load shedding for emergency voltage control,” arXiv
preprint arXiv:2006.12667, 2020. DOI:10.48550/arXiv.2006.12667

Y. Pei, J. Yang, J. Wang, P. Xu, T. Zhou and F. Wu, “An emergency control strategy
for undervoltage load shedding of power system: A graph deep reinforcement learning
method,” IET Generation, Transmission & Distribution, vol. 17, no. 9, pp. 2130–2141,
2023. DOI:10.1049/gtd2.12795

J. Li, S. Chen, X. Wang and T. Pu, “Load shedding control strategy in power grid emergency
state based on deep reinforcement learning,” CSEE Journal of Power and Energy Systems,
vol. 8, no. 4, pp. 1175–1182, 2021. DOI:10.17775/CSEEJPES.2020.06120

J. Zhang, Y. Luo, B. Wang, C. Lu, J. Si and J. Song, “Deep reinforcement learning
for load shedding against short-term voltage instability in large power systems,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 4249–4260,
2021. DOI:10.1109/TNNLS.2021.3121757

H. Chen, J. Zhuang, G. Zhou, Y. Wang, Z. Sun and Y. Levron, “Emergency load shedding
strategy for high renewable energy penetrated power systems based on deep reinforcement
learning,” Energy Reports, vol. 9, pp. 434–443, 2023. DOI:10.1016/j.egyr.2023.03.027

Z. Hu, Z. Shi, L. Zeng, W. Yao, Y. Tang and J. Wen, “Knowledge-enhanced deep reinforcement
learning for intelligent event-based load shedding,” International Journal of Electrical
Power & Energy Systems, vol. 148, no. 108978, pp. 1-11, 2023. DOI:10.1016/j.ijepes.2023.108978

Y. Zhang, M. Yue and J. Wang, “Adaptive load shedding for grid emergency control via
deep reinforcement learning,” in 2021 IEEE Power & Energy Society General Meeting
(PESGM). pp. 1-5, 2021. DOI:10.1109/PESGM46819.2021.9638058

저자소개
Haotian Zhang received the B.S. degree in mechanical engineering from Qingdao University
of Science and Technology, Qingdao, China, and Hanyang University, Ansan, South Korea,
in 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang
University, Seoul, South Korea. His research interests include optimal control, smart
grid, deep reinforcement learning, and their applications.
Chen Wang received the B.S. degree in electronics and computer engineering and the
M.S. degree in electronic computer engineering from Chonnam National University, South
Korea, in 2020 and 2022. He is currently pursuing the Ph.D. degree in electrical engineering
at Hanyang University, Seoul, South Korea. His research interests include smart grid,
deep reinforcement learning, and their applications.
Minju Lee received the B.S. degree in climate and energy systems engineering from
Ewha Womans University, Seoul, South Korea, in 2022, where she is currently pursuing
the degree with the Department of Climate and Energy Systems Engineering. Her research
interests include short-term wind power forecasting and the probabilistic estimation
of transmission congestion for grid integration.
Myoung Hoon Lee received the B.S. degree in electrical engineering from Kyungpook
National University, Daegu, South Korea, in 2016, and the Ph.D. degree in electrical
engineering from the Ulsan National Institute of Science and Technology, Ulsan, South
Korea, in 2021. From 2021 to 2023, he was a Postdoctoral Research Fellow with the
Research Institute of Electrical and Computer Engineering, Hanyang University, Seoul,
South Korea. He is currently an Assistant Professor with the Department of Electrical
Engineering, Incheon National University, Incheon, South Korea. His research interests
include decentralized optimal control, mean field games, deep reinforcement learning,
and their applications.
Jun Moon is currently an Associate Professor in the Department of Electrical Engineering
at Hanyang University, Seoul, South Korea. He received the B.S. degree in electrical
and computer engineering, and the M.S. degree in electrical engineering from Hanyang
University, Seoul, South Korea, in 2006 and 2008, respectively. He received the Ph.D.
degree in electrical and computer engineering from University of Illinois at Urbana-Champaign,
USA, in 2015. From 2008 to 2011, he was a researcher at Agency for Defense Development
(ADD) in South Korea. From 2016 to 2019, he was with the School of Electrical and
Computer Engineering, Ulsan National Institute of Science and Technology (UNIST),
South Korea, as an assistant professor. From 2019 to 2020, he was with the School
of Electrical and Computer Engineering, University of Seoul, South Korea, as an associate
professor. He is a recipient of the Fulbright Graduate Study Award 2011. His research
interests include stochastic optimal control and filtering, reinforcement learning,
data-driven control, distributed control, networked control systems, and mean field
games.