장호천
                     (Haotian Zhang)
                     1iD
                     왕천
                     (Chen Wang)
                     1iD
                     이민주
                     (Minju Lee)
                     2iD
                     이명훈
                     (Myoung Hoon Lee)
                     ††iD
                     문준
                     (Jun Moon)
                     †iD
               
                  - 
                           
                        (Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of
                        Korea.)
                        
- 
                           
                        (KEPCO Research Institute, Daejeon Metropolitan City 34056, Republic of Korea.)
                        
 
            
            
            Copyright © The Korea Institute for Structural Maintenance and Inspection
            
            
            
            
            
               
                  
Key words
               
                Deep reinforcement learning,  energy dispatch,  topology control,  emergency load shedding
             
            
          
         
            
                  1. Introduction       	
               With the large-scale popularization of renewable energy, the global energy structure
                  is undergoing profound changes. Fossil energy sources are gradually being depleted,
                  while advances in clean energy technologies have made renewable energy sources such
                  as solar and wind energy progressively more economical than traditional energy sources.
                  Innovative models such as microgrids and decentralized energy systems have facilitated
                  energy penetration, accelerated energy transition [1], and promoted low-carbon development [2]. Modern energy systems are evolving towards multi-energy complementarity and smart
                  low-carbon, with the deep coupling of electricity, natural gas, hydrogen, and other
                  energy sources, enhanced energy storage and demand-side flexibility [3], promoting the reconfiguration of traditional energy systems. However, this transformation
                  also brings great challenges, especially the volatility and instability of renewable
                  energy. With large-scale access to variable renewable energy (VRE) sources such as
                  wind and solar, the power system faces greater challenges in balancing supply and
                  demand [4], especially when the traditional grid struggles to adapt to the growing electricity
                  demand and the increased penetration of renewable energy sources, which exacerbates
                  the risk of supply-demand imbalance [5]. In addition, the complexity of urban energy systems also puts higher demands on
                  energy efficiency and policy assessment, and current modeling of urban energy systems
                  still faces challenges in technical design, building design, urban climate, system
                  design, and policy assessment, such as model complexity, data quality, and uncertainty
                  [6].
               
               In this context, how to ensure the security, stability and efficiency of the power
                  system has become an urgent problem. Among them, energy dispatch, topology control,
                  and emergency load shedding are the three core tasks to ensure the stable operation
                  of the power grid. However, the traditional model-based control methods for these
                  three tasks show obvious limitations in dealing with complex dynamic changes [7]. This is because power systems gradually exhibit nonlinearities, uncertainties, and
                  stochasticity, which make it difficult for physical models to effectively deal with
                  actual operating conditions. Moreover, under rapidly changing power demand and energy
                  supply conditions, traditional methods may have lagged responses and fail to achieve
                  optimal control, thus affecting the overall efficiency and reliability of the power
                  grid [8]. For example, in [9], it is investigated to optimize the parameter configuration and siting of a hybrid
                  energy storage system by constructing a simplified frequency response (SFR) model
                  and combining it with an explicit gradient calculation method. In [10], the study constructs a modified SFR model and optimizes the inertia control and
                  sag control parameters of the wind turbine to enhance the primary frequency regulation
                  performance. In [11], a Fractional Order Proportional Integral Derivative-Fractional Order Proportional
                  Integral (FOPID-FOPI) two-stage controller based on fractional order proportional
                  integral-derivative (FOPID) is proposed for direct power control of DFIG-based wind
                  energy systems. All of [9-11] have achieved stable grid operation in terms of individual scheduling and direct
                  control, but these methods still rely on physical modeling. In the face of these limitations,
                  there is an urgent need for more flexible and efficient control and optimization tools
                  to achieve stable power system operation [12] and to meet the challenges of the three core tasks of energy dispatch, topology control
                  and emergency load shedding [13,14].
               
               Among these three tasks, energy dispatching requires precise generation planning to
                  adapt to fluctuating electricity demand and unstable renewable energy supply; topology
                  control optimizes the grid structure and dynamically adjusts the power flow to prevent
                  transmission line overload and improve system security [15]; and emergency load shedding serves as the last line of defense for grid security
                  by intelligently and selectively disconnecting some of the loads in extreme situations
                  to prevent cascading failures from triggering large-scale blackouts [16]. These tasks are crucial for power system stability and security, but traditional
                  methods are difficult to cope with the high  dimensionality, dynamic changes and complex
                  coupling relationships involved. Therefore, there is an urgent need to introduce intelligent
                  optimization methods to provide more flexible and real-time solutions under uncertain
                  environments to ensure the safe and efficient operation of power grids under complex
                  conditions.
               
               Deep reinforcement learning (DRL) [17] has emerged as a key research focus for addressing power system challenges. DRL algorithms
                  are able to deal with complex dynamic environments in the power system and overcome
                  the limitations of the traditional physical models in nonlinear and stochastic problems
                  through a data-driven approach [18,19]. It is worth noting that [18] mainly reviewed the application of DRL in power system optimization and control,
                  emphasized the advantages of DRL over traditional physical model-based methods in
                  dealing with complexity and uncertainty, and summarized the research progress of DRL
                  in various fields, such as smart grids, demand-side management, and power markets.
                  While this paper focuses more on the application of DRL in power scheduling, topology
                  switching and emergency load shedding, focusing on its optimization policies and technological
                  breakthroughs in high-dimensional dynamic environments and analyzing how to enhance
                  the flexibility and real-time decision-making capability of power grids. In addition,
                  DRL not only learns the scheduling and control policies automatically but also responds
                  to the system changes in real-time, significantly improving the operational efficiency
                  and responsiveness of the power system [20-22]. Specifically,[20] introduces a multi-agent DRL-based volt-VAR optimization (VVO) algorithm that optimizes
                  scheduling and control policies, improving operational efficiency and responsiveness.
                  [21] presents a data-driven, model-free DRL-based load frequency control (LFC) method,
                  achieving faster response, stronger adaptability, and enhanced frequency regulation
                  under renewable energy uncertainties. [22] develops adaptive DRL-based emergency control schemes, reinforcing grid security
                  and resilience through robust policies for generator dynamic braking and under-voltage
                  load shedding. Not only that, optimizing energy dispatch, topology control, and emergency
                  load shedding with DRL effectively balances supply-demand fluctuations while flexibly
                  managing grid dynamics to ensure stable system operation [23].
               
               
                     1.1 Main Contributions
                  The related review papers [18,23] and [24] all provide a systematic overview of DRL applications in power systems, covering
                     its basic principles, algorithm classification, and research progress in the fields
                     of power dispatch, demand response, power market and operation control. Compared to
                     this paper, [18,23] and [24] focus more on the basic theory, algorithm development and overall application prospect
                     of DRL.
                  
                  Different from the existing related review studies, this paper systematically combs
                     through the current research status of DRL in power systems and its applications,
                     and thoroughly discusses the three key control means energy dispatch, topology control,
                     and emergency load shedding in the stable operation of power systems. By comprehensively
                     analyzing the existing research results and technological evolution, this paper summarizes
                     the advantages of DRL in power system optimization, and at the same time reveals the
                     limitations and challenges of the current research, and further proposes possible
                     future research directions and improvement policies to promote the practical application
                     and development of DRL in smart grids. The specific contributions are as follows:
                  
                  1. Relevant methods of DRL and their applications in power systems are systematically
                     summarized, covering three control tools for the stable operation of power systems:
                     energy dispatch, topology control, and emergency load shedding, providing researchers
                     with a clear overview of the current state of research.
                  
                  2. Future research directions in the areas of energy dispatch, topology control, and
                     emergency load shedding are explored with a focus on key challenges such as multi-task
                     coordination, renewable energy integration, safety constraints, and Sim2Real migration.
                     This paper proposes to enhance task optimization through multi-task reinforcement
                     learning and hierarchical reinforcement learning, enhance DRL adaptation to renewable
                     energy uncertainty by combining meta-learning and probabilistic modeling, and ensure
                     grid security by using constraint-based reinforcement learning. Meanwhile, transfer
                     learning and model-free DRL (Online Policy) are emphasized to bridge the gap between
                     simulation and the actual grid, and to promote the secure deployment and application
                     of DRL.
                  
                  In the remainder of this paper, Section 2 will provide a comprehensive overview of
                     the fundamentals of DRL and advanced techniques. Section 3 will describe the applications
                     of DRL to three power system problems: energy dispatch, topology control, and emergency
                     load shedding. Section 4 will discuss several potential future research directions.
                     Finally, we conclude the paper in Section 5.
                  
                
             
            
                  2. Review of Deep Reinforcement Learning 	
               This section establishes the basic formulation of reinforcement learning (RL) problems
                  and introduces key concepts such as the Q-function and the Bellman equation. These
                  foundational concepts provide support for understanding subsequent algorithms. Next,
                  we discuss classical RL algorithms, categorizing them into value-based and policy-based
                  methods. Finally, we will introduce several advanced RL techniques, including DRL,
                  Deterministic Policy Gradient, Actor-Critic methods, hierarchical RL, and Safe RL.
               
               
                     2.1 Reinforcement Learning
                  RL is a branch of machine learning that focuses on how an agent can make sequential
                     decisions in uncertain environments to maximize cumulative rewards. Mathematically,
                     the decision-making problem can be modeled as a Markov decision process (MDP), which
                     consists of a state space $S$, an action space $A$, and a transition probability function
                     $P(\bullet | s,\: a): S\times A → S$ that maps state-action pairs $(s,\: a)\in S\times
                     A$ to the state space, along with a reward function $r(s,\: a): S\times A → R$.
                  
                  In the MDP setting, the environment begins from an initial state $s_{0}\in S$. At
                     each time step $t =\{0,\: 1,\: ...\}$, given the current state $s_{t}\in S$, the agent
                     selects an action $a_{t}\in A$, and based on the current state-action pair $(s_{t},\:
                     a_{t})$, receives a corresponding reward $r(s_{t},\:  a_{t})$. Subsequently, the next
                     state $s_{t+1}$ is randomly generated according to the transition probability $P(s_{t+1}|
                     s_{t},\:  a_{t})$. The agent’s policy $\pi(a | s)\in A$ is a mapping from the state
                     $s$ to a distribution over the action space $A$, which specifies the actions that
                     should be taken in a given state $s$. The goal of the agent is to find an optimal
                     policy $\pi^{*}$, though this policy may not be unique. The MDP process is illustrated
                     in Fig. 1.
                  
                  
                        
                        
Fig. 1. Decision process of Markov decision process.
                      
                  
                        2.1.1 Value Function and Optimal Policy
                     To maximize long-term cumulative rewards after the current time $t$, the return $R_{t}$
                        over a finite time horizon can be expressed as:
                     
                     
                     where the discount factor $\gamma\in[0,\: 1]$ is a parameter used to discount future
                        rewards.
                     
                     To find the optimal policy, some algorithms rely on the value function $V_{\pi}(s)$,
                        which represents the expected return when the agent reaches a given state $s$. This
                        function depends on the agent’s actual policy $\pi$:
                     
                     
                     Similarly, the action-value function Q represents the value of taking action $a$ in
                        state $s$:
                     
                     
                     The value function $V$ and action-value function $Q$ can be expressed iteratively
                        through the Bellman equation:
                     
                     
                     The optimal policy $\pi^{*}$ is the policy that maximizes the long-term cumulative
                        return, which can be expressed as:
                     
                     
                   
                
               
                     2.2 Deep Reinforcement Learning
                  The journey from RL to DRL has undergone a long developmental process. In classical
                     tabular RL such as Q-learning, the state and action spaces are usually small, allowing
                     the approximate value function to be represented as a Q-value table. In this case,
                     these methods are often able to find the exact optimal value function and policy [25,26]. However, in many real-world problems, the state and action spaces are large or continuous,
                     and the system dynamics are highly complex. Therefore, value-based RL struggles to
                     compute or store a huge Q-value table for all state-action pairs.
                  
                  To address this problem, researchers developed function approximation methods, using
                     parameterized function classes such as linear functions or polynomial functions to
                     approximate the Q function. For policy-based RL, finding an appropriate policy class
                     to achieve optimal control is also crucial in high-dimensional complex tasks. With
                     advances in deep learning, the use of artificial neural networks (ANN) for function
                     approximation or policy parameterization has become increasingly popular in DRL. The
                     theoretical foundations and historical evolution of deep learning [27], its breakthroughs in representation learning [28], and optimization techniques for training deep models [29] have all contributed to the widespread adoption of ANN in this field. Specifically,
                     DRL can be implemented in the following ways:
                  
                  
                  (1) Value-based methods
                  
                  
                  
In temporal difference (TD) learning [30] and Q-learning [31], a Q-network can be used to approximate the Q function. For TD learning, the update
                     rule for parameters 𝑤 is:
                  
                  
                  where the gradient $\nabla_{\omega}\phi Q_{\omega}(s_{t},\:  a_{t})$ can be efficiently
                     computed using backpropagation. In Q-learning, approximating using nonlinear functions
                     (e.g., ANN) often results in instability and divergence during training. To address
                     these issues, researchers developed Deep Q-Networks (DQN) [32], which significantly improved the stability of Q-learning through two key techniques.
                  
                  ▪ Experience replay: Instead of training on consecutive episodes, a widely used technique
                     is to store all transition experiences $D =(s_{t},\: a_{t},\: r_{t},\: s_{t+1})$ in
                     a database called the replay buffer $D$. At each step, a batch of transitions is randomly
                     sampled from the replay buffer $D$ for Q-learning updates. This method improves data
                     efficiency by recycling past experiences and reduces the variance in learning updates.
                     More importantly, uniformly sampling from the replay buffer breaks the temporal correlation,
                     enhancing the stability and convergence of Q-learning.
                  
                  ▪ Target network: Another technique is to introduce a target network $Q_{\omega}^{-}(s,\:
                     a)$, which is a clone of the Q-network $Q_{\omega}(s,\: a)$. The parameters $\omega$
                     of the target network remain frozen during training and are only updated periodically.
                     Specifically, a batch of transitions $(s_{i},\:  a_{i},\:  r_{i},\: s_{i+1}| i=0,\:
                     1,\: ...)$ is sampled from the replay buffer $D$ for training, and the Q-network is
                     updated using the following formula: 
                  
                  
                  This optimization process can be seen as finding an approximate solution to satisfy
                     the Bellman optimality equation [33]. The key is to use the target network $Q_{\omega}^{-}$ with parameters $\omega_{i+1}$
                     to compute the maximum action value, rather than using the Q-network $Q_{\omega}$
                     directly. After a fixed number of updates, the parameters $\omega_{i+1}$ are replaced
                     with the newly learned $\omega$ to update the target network. This technique mitigates
                     instability and prevents short-term oscillations during training.
                  
                  Moreover, several notable DQN variants can further improve performance, such as Double
                     DQN [34] and Dueling DQN [35].
                  
                  Double DQN addresses the overestimation bias in DQN by learning two sets of Q functions—one
                     for selecting actions and the other for evaluating their values. Dueling DQN introduces
                     a novel network architecture that decomposes the estimation of the Q-value into two
                     parts: one part estimates the state value function $V(s)$, and the other estimates
                     the action advantage function $A(s,\: a)$, which is dependent on the state. Through
                     this decomposition, the dueling network allows the agent to better assess the importance
                     of different states during the learning process, even when the choice of actions does
                     not directly affect the learning. This architecture helps improve learning efficiency,
                     particularly in situations where the Q-values of different actions have minimal differences
                     in certain states.
                  
                  
                  (2) Policy-based methods
                  
                  
                  
Due to their strong generalization capabilities, artificial neural networks are widely
                     used to parameterize control policies, especially when state and action spaces are
                     continuous. The resulting policy network $NN(a | s ;\theta)$ takes the state as input
                     and outputs the probabilities of selecting actions. In Actor-Critic methods, both
                     a Q-network $NN(s ,\:  a ;\omega)$ and a policy network $NN(a | s ;\theta)$ are typically
                     used, where the "actor" updates the parameters $\theta$ according to the policy, and
                     the "critic" updates the parameters $\omega$ according to the Q-values. The gradient
                     of an ANN can be efficiently computed using backpropagation [29].
                  
                  When using function approximation, theoretical analyses of value-based and policy-based
                     RL methods are relatively scarce and are usually limited to linear function approximations.
                     Furthermore, one major challenge for value-based methods when dealing with large or
                     continuous action spaces is the difficulty of executing the maximization step. For
                     instance, when using deep artificial neural networks to approximate the Q function,
                     finding the optimal action $a$ is not trivial due to the nonlinearity and complexity
                     of $Q_{\omega}(s,\: a)$.
                  
                
               
                     2.3 DRL Related Algorithms
                  In this subsection, we will explore several advanced DRL algorithms, including Deep
                     Deterministic Policy Gradient (DDPG), Actor-Critic (AC), Proximal Policy Optimization
                     (PPO), Hierarchical Reinforcement Learning (HRL), and Safe Reinforcement Learning
                     (SRL). These algorithms demonstrate exceptional performance in handling complex environments
                     and tasks, driving the application and development of DRL. The framework structure
                     of the DRL algorithm is shown in Fig. 2.
                  
                  
                        
                        
Fig. 2. Classification of Deep Reinforcement Learning Algorithms by Framework: Value-based,
                           Policy-based, and Actor-Critic-based.
                        
                      
                  
                        2.3.1 Deterministic Policy Gradient (DPG)
                     Most reinforcement learning algorithms focus on stochastic policies $a\sim\pi_{\theta}(a
                        | s)$, but deterministic policies $a =\pi_{\theta}(s)$ [36] are more suitable for many real-world control problems with continuous state and
                        action spaces. This is because, on the one hand, many existing controllers for physical
                        systems (such as PID and robust control) are deterministic, making deterministic policies
                        a better match for practical control architectures, especially in power system applications.
                        On the other hand, deterministic policies are more sample-efficient, as their policy
                        gradient only integrates over the state space, while the gradient of stochastic policies
                        integrates over both the state and action spaces. Similar to stochastic policies,
                        deterministic policies also have a policy gradient theorem [37], which expresses the gradient as:
                     
                     
                     A key issue with deterministic policies is the lack of exploration due to deterministic
                        action selection. A common approach to address this is to apply exploration noise
                        to the deterministic policy, such as adding Gaussian noise $\xi$ to the policy $a
                        =\pi_{\theta}(s)+\xi$.
                     
                   
                  
                        2.3.2 Actor Critic Methods
                     Actor critic algorithms [38] combine the advantages of policy gradient and value iteration. In this framework,
                        the actor and critic networks perform different functions. Specifically, the actor
                        network is responsible for policy optimization, outputting the action $a\sim\pi_{\theta}(a
                        | s)$ for a given state by directly generating actions using the parameterized policy
                        $\pi_{\theta}$. The Critic network estimates the state-action value function $Q_{\pi}(s,\:
                        a)$ or advantage function $A_{\pi}(s,\:  a)$, providing feedback to guide the actor's
                        optimization. During policy updates, the actor adjusts its parameters based on feedback
                        from the critic, with the update following the policy gradient theorem defined as:
                     
                     
                     where $\nabla_{\theta}\log\pi_{\theta}(a | s)$ represents the policy gradient, and
                        $Q_{\pi}(s,\:  a)$ estimated by the critic, guides the actor's action adjustments.
                     
                     Despite the success of Actor-Critic methods in many complex tasks, they are prone
                        to issues like high variance, slow convergence, and local optima. Therefore, various
                        variants have been developed to improve their performance, including:
                     
                     Advantaged Actor-Critic (A2C): A2C [39] introduces the advantage function $A(s,\: a)=Q_{\pi_{\theta}}(s,\:  a)- V(s)$, where
                        the Q-function is replaced by the difference between the state-action value and the
                        state value function $V(s)$, reducing the variance of the policy gradient and improving
                        stability.
                     
                     Asynchronous Actor Critic (A3C): A3C [40] improves sample efficiency and stability by training multiple agents in parallel.
                        Each agent interacts with the environment using different exploration policies and
                        the global parameters are updated synchronously across agents, enhancing convergence
                        speed and performance.
                     
                     Soft Actor Critic (SAC): SAC [41] operates under the maximum entropy RL framework, using stochastic policies and incorporating
                        an entropy term $H(\pi_{\theta}(\bullet | s_{t}))$ in the objective function to encourage
                        exploration and improve policy robustness, while maintaining efficient learning.
                     
                   
                  
                        2.3.3 Proximal Policy Optimization (PPO)
                     PPO [42] is a policy gradient-based optimization method that balances stability and sample
                        efficiency, widely applied in tasks involving both continuous and discrete action
                        spaces. The core idea of PPO is to limit the magnitude of policy updates, preventing
                        the policy from collapsing due to large updates and ensuring a smoother training process.
                        The objective function in PPO is based on clipping, which limits the magnitude of
                        changes between the old and new policies. The objective function is:
                     
                     
                     where $r_{t}(\theta)=\dfrac{\pi_{\theta}(a | s)}{\pi_{\theta_{old}}(a | s)}$ is the
                        probability ratio between the new and old policies, $\hat{A_{t}}$ is the estimated
                        advantage function, and $\epsilon$ is a hyperparameter controlling the update range.
                        By using this clipping mechanism, PPO ensures that the magnitude of policy updates
                        remains within a specified range, preventing the risk of policy collapse.
                     
                     Despite PPO's strong performance, policy exploration remains a challenge in high-dimensional
                        and complex tasks. To ensure a broad exploration, entropy regularization can be introduced
                        to PPO, maintaining the randomness of the policy and preventing premature convergence
                        to local optima.
                     
                   
                  
                        2.3.4 Hierarchical Reinforcement Learning (HRL)
                     HRL [43] improves learning efficiency and addresses complex problems by decomposing tasks
                        into multiple hierarchical sub-tasks. The key idea of HRL is to introduce agents at
                        different levels, where higher-level agents select abstract sub-tasks or "options,"
                        while lower-level agents focus on executing these sub-tasks. This hierarchical structure
                        effectively reduces the decision space, speeds up convergence, and performs well in
                        long-term decision-making problems.
                     
                     One significant advantage of HRL is that it allows agents to learn at different levels,
                        improving the generalization ability of the policy. This structure is particularly
                        suitable for complex tasks in power systems, such as topology control and emergency
                        load shedding. The high-level policy determines whether to adjust the topology or
                        shed load during grid failures, while the low-level policy focuses on specific actions,
                        such as shutting down transmission lines or disconnecting loads. By task decomposition,
                        HRL significantly reduces the complexity of learning and enhances the system's response
                        efficiency and stability.
                     
                     However, HRL faces challenges such as defining and designing appropriate hierarchies
                        and sub-tasks. Overly complex hierarchies may destabilize the training process, while
                        overly simple hierarchies may not fully leverage the advantages of hierarchical policies.
                        Coordination between high-level and low-level policies is also a challenge, ensuring
                        effective collaboration between different levels of policies in long-term and short-term
                        goals. Furthermore, HRL typically requires longer training times, making training
                        efficiency a bottleneck.
                     
                   
                  
                        2.3.5 Safe Reinforcement Learning (SRL)
                     SRL [44] focuses on ensuring system safety during the RL process, particularly in high-risk
                        fields like power system control, where unsafe decisions must be avoided. In traditional
                        RL, agents explore and interact with the environment, trying different policies to
                        maximize cumulative rewards. However, this free exploration can lead to unsafe behaviors,
                        especially in power systems, where unsafe actions may cause instability, equipment
                        damage, or serious service disruptions. SRL aims to optimize long-term returns while
                        constraining agent behavior, ensuring that safety limits are not violated during learning
                        and in the final policy.
                     
                     SRL methods often introduce constraints to ensure safety. A common approach is to
                        embed safety constraints into the optimization objective, forming a constrained RL
                        problem. The agent must not only maximize rewards but also satisfy specific safety
                        constraints. These constraints can be enforced through penalty mechanisms, where the
                        agent is penalized for taking unsafe actions, forcing it to optimize behavior to avoid
                        future constraint violations.
                     
                     
                     Another method is to adopt a safe exploration policy, limiting
                     
                     
                     
the agent's action space during exploration to ensure that dangerous behaviors are
                        not executed. For example, model-based SRL methods build an environment model to predict
                        potential outcomes under different policies and avoid executing high-risk actions
                        in advance.
                     
                     SRL methods offer significant advantages in many applications. First, they ensure
                        that agents follow safety constraints during both learning and policy execution, avoiding
                        dangerous behaviors. This is particularly useful in high-risk fields, where SRL can
                        effectively reduce system failures and losses by balancing policy optimization and
                        safety constraints. Moreover, SRL's safe exploration mechanisms restrict unsafe operations
                        during training, preventing system damage from improper exploration. SRL also improves
                        the robustness of agents, helping them maintain stable performance in uncertain and
                        unexpected environments.
                     
                     However, implementing SRL also poses challenges. Designing appropriate safety constraints
                        is a key challenge—constraints that are too strict can limit the agent's learning
                        space, while overly loose constraints may lead to safety risks. SRL also faces computational
                        pressures in high-dimensional dynamic environments, especially in complex systems
                        where computing resources and training time become bottlenecks. Furthermore, safety
                        constraints can limit the agent's exploration ability, affecting the efficiency of
                        policy optimization and convergence speed. Therefore, balancing exploration and exploitation
                        while ensuring safety remains a key challenge in SRL implementation.
                     
                   
                
             
            
                  3. Application in Power System	
               In recent years, DRL has been widely used in maintaining power system stability. DRL
                  techniques enable more efficient optimization of the execution of energy dispatch,
                  topology control, and emergency load shedding, which are key measures for ensuring
                  power system stability. Compared with traditional control methods, DRL can provide
                  more accurate control and significantly improve system stability and response efficiency
                  when dealing with complex grid environments.  In the following, we will introduce
                  the specific applications of DRL in the above tasks respectively. Table 1 in Appendix summarizes the applications of DRL algorithms in power systems.
               
               
                     3.1 Energy Dispatch
                  The growth of distributed energy and electric vehicles makes balancing power supply
                     and demand more critical, while grid scaling increases uncertainty. DRL, as a data-driven
                     method, offers adaptability by optimizing energy dispatch without relying on precise
                     models, learning through interaction with the grid. Fig. 3 visually illustrates the energy dispatch problem in a small power system, covering
                     key elements such as renewable energy sources, generating stations, loads, and energy
                     storage devices. The core objective of using DRL in this system is to ensure a balance
                     between supply and demand and to minimize energy losses. Fig. 4 further details the operational flow of the DRL methodology for optimizing the energy
                     dispatch problem and thereby reducing energy losses.
                  
                  
                        
                        
Fig. 3. Power system energy dispatch of [Fig. 7, 14].
                        
                      
                  
                        
                        
Fig. 4. DRL-based methods for optimal energy dispatch in power systems [Fig. 1, 48].
                        
                      
                  Literature [45] models the power scheduling process as a dynamic sequential control problem, and
                     proposes an optimized coordinated scheduling policy to cope with wind and demand perturbations
                     by modeling the Markov decision process and combining Monte Carlo methods and Q-learning
                     RL algorithms in order to reduce the long term operation and maintenance costs. Literature
                     [46] employs an improved DRL technique of deep deterministic policy gradient (DDPG) to
                     address the limitations of existing dispatch schemes that rely on  forecasting and
                     modeling by modeling the dynamic dispatch problem as an MDP, thus achieving an adaptive
                     response to the uncertainty of renewable energy generation and demand fluctuations
                     in an integrated energy system (IES). Literature 47 proposes a robust and scalable deep Q-learning (DQN) based DRL optimization algorithm
                     to solve the problem of balancing economic cost and environmental emission in renewable
                     energy integrated electricity scheduling by decomposing the power scheduling problem
                     into MDPs and training DRL models in a multi-intelligence body simulation environment.
                     Literature [48] proposes a soft actor-critic (SAC) based autonomous control approach to address the
                     challenges posed by large-scale renewable energy integration for active power scheduling
                     in modern power systems by introducing Lagrange multipliers and imitation learning,
                     which significantly improves the consumption rate of renewable energy sources and
                     the robustness of the algorithm. It is worth noting that literature [45]-[48] has not compared traditional control methods, but their performance in grid control
                     is still excellent. In order to effectively demonstrate the advantages of DRL, literature
                     [49], [14] are compared and analyzed with traditional methods. Literature [49] proposes an optimal scheduling method based on asynchronous advantage actor-critic
                     (A3C) DRL algorithm, which can cope with the uncertainty of renewable energy sources
                     and users' energy demand in the IES by constructing the state space, action space,
                     and reward function, thus realizing  the economic scheduling of the system and the
                     complementarity of multiple energy sources. It is worth stating that literature [49] compares with traditional methods and achieves the same performance as mathematical
                     planning methods. Literature [14] adopts a SAC-based DRL method to solve the energy dispatch problem in distributed
                     grids by optimizing the reward function related to energy dispatch to find the optimal
                     policy. Experimental results show that the method is able to achieve similar results
                     with the traditional model predictive control (MPC) algorithm in a 6-bus power system.
                  
                  
                        
                        
Table 1 Task types and applications of DRL algorithms in power systems (Energy dispatch:
                           ED / Topology control: TC / Emergency load shedding: ELS).
                        
                     
                     
                           
                              
                                 | Literature | Field | Algorithm | Type | Objective | Improvement | 
                           
                                 | 45 | ED | Q-learning | Value-based | Cost reduction | Distinguishing from traditional short-term cost optimization, the model provides long-term
                                    optimal scheduling.
                                  | 
                           
                                 | 46 | ED | DDPG | Actor-Critic | Improving adaptive responses to uncertainty in the grid | The method is more adaptable than traditional methods, as it does not need to rely
                                    on predictive information or knowledge of the system's uncertainty distribution.
                                  | 
                           
                                 | 47 | ED | DQN | Value-based | Cost reduction | The model can handle multiple objectives to accommodate the complexity of modern power
                                    systems.
                                  | 
                           
                                 | 48 | ED | SAC | Actor-Critic | Cost reduction | The model enhances the robustness of the system and reduces the consumption rate of
                                    renewable energy.
                                  | 
                           
                                 | 49 | ED | A3C | Actor-Critic | Improving adaptive responses to uncertainty in the grid | The model outputs dispatch policies in real time, which avoids relying on accurate
                                    source load forecasts and effectively copes with source load uncertainty and volatility.
                                  | 
                           
                                 | 14 | ED | SAC | Actor-Critic | Cost reduction | The model is enhanced for the DRL policy and enhances the security of the policy. | 
                           
                                 | 50 | TC | DDDQN | Value-based | Safe operation of the power grid | This literature suggests applying the DRL algorithm to larger, more constrained power
                                    systems and incorporating more control variables (e.g., line switching, transformer
                                    regulation, generation scheduling).
                                  | 
                           
                                 | 51 | TC | Cross-Entropy | Policy-based | Safe operation of the power grid | The model further analyzes the topology control behavior of the agents and illustrates
                                    the ability to improve the generalization of the model while maintaining the simplicity
                                    of the approach.
                                  | 
                           
                                 | 52 | TC | AC | Actor-Critic | Improve topology control accuracy | The model effectively solves the problem of sparse rewards and high-dimensional state
                                    spaces in the grid environment.
                                  | 
                           
                                 | 53 | TC | SAC | Actor-Critic | Improve topology control accuracy | The model improves the accuracy of the DRL algorithm by introducing an attention mechanism. | 
                           
                                 | 54 | TC | SAC | Actor-Critic | Safe operation of the power grid | The model designs a pre-training scheme for the SAC algorithm to improve the robustness
                                    and efficiency of the algorithm.
                                  | 
                           
                                 | 55 | ELS | PARS | Actor-Critic | Improving adaptive responses to uncertainty in the grid | The model utilizes the derivative-free nature and parallelism of the proposed algorithm
                                    to substantially improve the training efficiency.
                                  | 
                           
                                 | 56 | ELS | DDDQN | Value-based | Improve emergency load shedding accuracy | The model effectively extracts the topological features of the grid, optimizes the
                                    emergency control policy, and improves the stability and economy of the grid under
                                    frequent topological changes.
                                  | 
                           
                                 | 57 | ELS | DDPG | Actor-Critic | Safe operation of the power grid | The model improves the training process and enhances the generalization of the control
                                    policy by introducing voltage information as a reward function.
                                  | 
                           
                                 | 58 | ELS | DQN | Value-based | Improve emergency load shedding accuracy | The model improves the system's adaptability and online operational performance in
                                    unseen scenarios through spatio-temporal information modeling and the application
                                    of control policies.
                                  | 
                           
                                 | 59 | ELS | PPO | Policy-based | Improving adaptive responses to uncertainty in the grid | The model enhances the frequency recovery capability of the system through DRL optimized
                                    control policy and can avoid triggering the system safety constraints.
                                  | 
                           
                                 | 60 | ELS | DDQN | Value-based | Improve emergency load shedding accuracy | The model improves training efficiency and decision quality through knowledge-enhanced
                                    DRL.
                                  | 
                        
                     
                   
                  In summary, the application of DRL in power system scheduling significantly improves
                     the adaptive capability and robustness of the system, enabling it to effectively cope
                     with fluctuations in renewable energy sources and demand uncertainty. By optimizing
                     the scheduling policy and reducing the reliance on complex mathematical models, DRL
                     provides new solutions for achieving cost-effective energy management. 
                  
                
               
                     3.2 Topology Control
                  Topology control is crucial for maintaining power system stability by dynamically
                     adjusting transmission line connections, optimizing current distribution, and  reducing
                     load on specific lines or nodes. The challenge lies in quickly finding the optimal
                     solution within the complex topology while meeting real-time regulation and stability
                     requirements. Fig. 5 illustrates a topology control problem for a small power system consisting of substations,
                     loads, generators, and energy storage devices. Topology control aims to optimize system
                     operation by adjusting the configuration of power lines or substations to improve
                     power supply reliability and efficiency. Fig. 6 further illustrates how the DRL operates in the topology control task, Literature
                     [50] uses a DRL method based on dueling double where the topology control is mainly optimized
                     by dynamically adjusting the substation configuration.
                  
                  
                        
                        
Fig. 5. Power system topology control of [Fig. 1, 50]. 
                        
                      
                  
                        
                        
Fig. 6. DRL-based methods for topology control problems in power systems   [Fig. 2, 52].
                        
                      
                  deep Q-Learning (DDDQN) and prioritized experience playback to achieve secure power
                     system operation through autonomous topology adjustment, which solves the problem
                     of traditional methods that are difficult to cope with complex grid control. Literature
                     [51] uses Cross-Entropy Method RL approach to train artificial intelligent agents to control
                     power flow in the power grid through topology switching operations to solve the stability
                     problem of power grid operation under uncertain generation and demand conditions.
                     Facing the problem of high-dimensional topological space in power systems, literature
                     [52] proposes a HRL-based method for grid topology regulation. The method extends the
                     actor-critic model in DRL to a hierarchical structure, where the upper layer generates
                     a desired topology configuration scheme based on the current state of the grid, and
                     the lower layer is responsible for executing specific policies to achieve this goal.
                     With this hierarchical architecture, the complexity of the high-dimensional state-action
                     space in topology control is effectively mitigated, which  improves the accuracy and
                     efficiency of the control policy. Literature [53] proposes a DRL method for SAC incorporating an attention mechanism, aiming to manage
                     the power system by adjusting the topology of the grid. The method improves the robustness
                     and computational efficiency of the model by assigning different feature weights so
                     that the attention  mechanism allows the neural network to focus on the input features
                     that are most relevant to the current target task. Literature [54] proposes a DRL-based approach to achieve stable autonomous control of power systems
                     through autonomous topology optimization control using the SAC algorithm, while introducing
                     an imitation learning (IL)-based pre-training scheme to cope with the huge action
                     space in topology switching and the vulnerability of DRL agents in power systems.
                  
                  In summary, recent studies have demonstrated the great potential of DRL in power system
                     topology control. Aiming at the complex and huge topology space problem, which is
                     difficult to cope with by traditional methods, these emerging methods seek the optimal
                     policy through autonomous decision-making and dynamic adjustment, which not only improves
                     the stability and reliability of the power system but also enhances its ability to
                     cope with unexpected situations.
                  
                
               
                     3.3 Emergency Load Shedding
                  Emergency load shedding is a key measure for  maintaining power system stability during
                     faults, overloads, or unexpected events. It reduces grid stress and protects equipment,
                     but the challenge lies in making quick decisions, assessing load priorities, coordinating
                     equipment communication, and managing diverse load characteristics for efficient shedding.
                     As shown in Fig. 7, the emergency load shedding problem in a medium-sized power system is illustrated,
                     where Bus 4, Bus 7, and Bus 18 are heavily loaded areas. During an overload event,
                     shedding loads in these areas helps alleviate system stress. Fig. 8 demonstrates that during an emergency load shedding task, voltage initially drops
                     due to the fault but gradually recovers as the load is reduced. The curve represents
                     the voltage recovery standard, indicating that voltage should remain above the curve
                     throughout the recovery process. To optimize load-shedding decisions, Fig. 9 presents a flowchart of using DRL to search for an effective shedding policy.
                  
                  In [55], an accelerated DRL algorithm named “PARS” was developed to solve the problems of
                     low computational efficiency and poor scalability of existing load shedding methods
                     in power system voltage stabilization control, and to improve the power system stability
                     under uncertainty and rapidly changing operating conditions through efficient and
                     fast adaptive control. Stability. Literature [56] proposed a contingency control scheme for undervoltage load shedding (UVLS) based
                     on  the GraphSAGE-DDDQN method, aiming at solving the problem of insufficient adaptability
                     and generalization  ability of the existing UVLS technology in coping with the scenarios
                     of topology changes in power networks, so  as to improve the reliability and economy
                     of the control policy. Literature [57] proposes a load shedding control policy based on DDPG deep strong chemistry  to address
                     the challenge of realizing autonomous voltage control in the event of power system
                     faults, which effectively improves the stable operation of the power system by constructing
                     a network training dataset and establishing a reward function that conforms to the
                     operational characteristics of the power grid. Literature [58] proposes a load shedding policy based on deep Q-network (DQN-LS) and convolutional
                     long and short-term memory network (ConvLSTM), aiming to improve the stability and
                     voltage restoration capability of large-scale power systems in dynamic load shedding
                     problems through real-time, fast and accurate load shedding decisions, especially
                     under different and uncertain power system fault conditions. Literature [59] proposes a data-driven and DRL-based emergency load curtailment approach to improve
                     the system by  transforming the load curtailment policy into a MDP and optimizing
                     it using a proximal policy optimization  (PPO) algorithm to solve the problems of
                     model complexity and matching risk faced by the traditional event-driven load curtailment
                     policies in renewable energy systems, thus improving the system's adaptability and
                     efficiency under multi-fault scenarios. Literature [60] proposes a knowledge-enhanced DDQN DRL approach for intelligent event-driven load
                     shedding  (ELS), which solves the deficiencies of traditional methods in terms of
                     efficiency and timeliness by building a MDP based on transient stability simulation
                     and incorporating the knowledge of removing repetitive and negative actions in order
                     to improve the training efficiency and the quality of decision making for the effective
                     formulation of load shedding measures. shortcomings in efficiency and timeliness.
                  
                  
                        
                        
Fig. 7. Power system emergency load shedding of [Fig. 13, 14].
                        
                      
                  
                        
                        
Fig. 8. Power system emergency load shedding voltage recovery curve of     [Fig. 4, 14]. 
                        
                      
                  
                        
                        
Fig. 9. DRL-based methods for emergency load shedding in power systems    [Fig. 3, 61].
                        
                      
                  In summary, recent studies have demonstrated the potential of multiple DRL-based load
                     curtailment control methods in dealing with power system contingencies. With innovative
                     algorithm design and flexible decision-making mechanisms, these methods effectively
                     enhance the stability and responsiveness of the power  system in the face of faults
                     and uncertainties, while improving the accuracy of load shedding.
                  
                
             
            
                  4. Challenges and Future Directions	
               Although DRL has achieved many successes in enhancing power system stability through
                  three main means, namely energy dispatch, topology control, and emergency load shedding,
                  it still faces several challenges, including the fact that current research is mainly
                  focused on single-task optimization and lacks consideration of multi-task coordination,
                  the issue of uncertainty and diversity integration of renewable energy sources, and
                  how to achieve a reliable deployment of DRL while safeguarding the security of the
                  power grid. In addition, the difference between the simulation environment and the
                  real grid operation (Sim2Real) limits the wide application of DRL in real power systems.
                  Therefore, in this section, we systematically discuss these key challenges and explore
                  feasible solutions to promote further the research and development of DRL in power
                  systems. 
               
               
                     4.1 Multi-task Coordination
                  In current studies, energy dispatch, topology control and emergency load shedding
                     are usually modeled and optimized separately as individual tasks. However, in the
                     operation of real power systems, these tasks are highly interdependent and the interactions
                     may lead to local optimization problems rather than global optimization. Traditional
                     methods often rely on heuristic search or staged optimization, which lacks a global
                     perspective and leads to difficulties in achieving efficient synergy between energy
                     scheduling, topology control, and emergency load shedding. At the same time, most
                     of the existing DRL studies use single-task learning, ignoring the intrinsic connection
                     between these three, resulting in policies that are difficult to generalize to multi-task
                     scenarios. Existing reinforcement learning methods, such as DQN and PPO, are mainly
                     aimed at single-objective optimization, which makes it difficult to take into account
                     the coordination and optimization of multiple control means in a complex power system
                     environment. Therefore, future studies should explore the construction of a unified
                     multi-task coordination framework, as shown in Fig. 10, in which DRL can dynamically switch between different tasks and perform comprehensive
                     optimization under a global perspective, thereby enhancing the security and efficiency
                     of the entire power grid. Multi-task learning (MTL) can be used to develop DRL models
                     capable of simultaneously optimizing energy dispatch, topology control and emergency
                     load shedding to leverage information sharing across tasks. Meanwhile, hierarchical
                     reinforcement learning (HRL) can be used to improve the overall flexibility and adaptability
                     of decision-making by constructing high-level and low-level policies, which enable
                     high-level decision-making to intelligently select appropriate control methods, while
                     low-level policies are responsible for the specific execution of the corresponding
                     optimization tasks. In addition, the multi-objective optimization (MORL) approach
                     can adaptively adjust the priorities of the three control means by introducing weighting
                     factors to achieve integrated scheduling optimization, thus avoiding local optimization
                     problems and promoting global stability. In terms of constructing global features,
                     it can be combined with graph neural network (GNN), self-attention and other methods,
                     so that the DRL can learn the dynamic changes of the grid structure, and combine them
                     with the optimization policies, such as energy storage management, to improve the
                     system responsiveness when dealing with unexpected events. This multi-task coordination
                     approach not only enhances the flexibility of the grid but also improves the robustness
                     and reliability of decision-making in the face of uncertainty challenges, thus promoting
                     the development of intelligent regulation of power systems in the direction of greater
                     efficiency and security.
                  
                  
                        
                        
Fig. 10. DRL-based multi-task coordination framework.
                      
                
               
                     4.2 Renewable Energy Integration
                  As renewable energy sources such as wind and solar are increasingly integrated into
                     modern power systems, managing their variability and uncertainty has become a key
                     challenge in DRL studies. As shown in Fig. 11, current research has focused on optimizing renewable energy consumption in specific
                     scenarios, while future research should focus on how to effectively integrate multiple
                     renewable energy sources and optimize their dynamic allocation in the grid. Different
                     types of renewable energy sources (e.g., wind, photovoltaic, hydropower, etc.) have
                     significant differences in output patterns, temporal characteristics, and spatial
                     distributions, and the standard DRL methods tend to assume that the input characteristics
                     of the power system are fixed while ignoring the system dynamics brought about by
                     the changes in the proportion of renewable energy sources, which makes the existing
                     methods lack of adaptability and generalization when facing the multi-source integration
                     problem. Therefore, future research directions should consider how to enable DRL to
                     quickly adapt to different renewable energy environments and improve its ability to
                     cope with grid volatility. For example, adopting a DRL method based on Meta-Learning
                     (ML) can enable the agent to learn and adapt itself quickly in different types of
                     renewable energy environments and improve its ability to cope with changes in system
                     dynamics. In addition, the introduction of uncertainty modeling techniques, such as
                     Bayesian DRL (Bayesian DRL) or probabilistic graphical models (PGMs), can enable DRL
                     to deal with the uncertainty of renewable energy sources more efficiently, thus improving
                     the robustness of scheduling decisions. There have been studies attempting to optimize
                     the co-dispatch of wind and PV systems using DRL in combination with energy storage
                     management, but the challenges of algorithms' generalization ability and adaptability
                     are still being faced. Therefore, the future development direction should focus on
                     how to improve the stability of DRL under uncertain environments so that it can effectively
                     integrate different renewable energy resources under dynamic grid conditions, and
                     ultimately improve the overall stability and operational efficiency of the grid.
                  
                  
                        
                        
Fig. 11. Challenges of DRL in integrating multiple renewable energies.
                      
                
               
                     4.3 Safety Constraints and Sim2Real
                  The issue of security constraints is crucial in the practical application of DRL in
                     power systems, but existing DRL methods may explore infeasible or even dangerous decisions
                     during the training process, such as over-adjusting the topology, which instead increases
                     the vulnerability of the grid, or adopting extreme load shedding policies, which leads
                     to power tidal imbalance or even violates the grid security standards. Therefore,
                     as shown in Fig. 12, future research needs to explore how to incorporate domain knowledge of the power
                     system into the DRL decision framework and introduce physical security constraints
                     during the training process to ensure that the policies learned by the model are always
                     in line with grid security requirements. For example, constrained DRL or barrier function
                     methods can be used to enable DRL to adaptively avoid security risks during the learning
                     process and to ensure that its control policy does not destabilize the power grid.
                     In addition, the application of DRL in power systems faces the Sim2Real (from simulation
                     to reality) problem, which means that the difference between the simulation environment
                     and the actual grid operation may cause the trained model to fail in the real environment.
                     Existing research mainly relies on the simulation environment for DRL training, but
                     the deviation between the simulation model and the physical characteristics of the
                     real grid leads to the fact that the DRL policy may perform well in the simulation
                     environment but has insufficient generalization ability to cope with the complex operating
                     conditions of the real grid when deployed in practice. Therefore, future research
                     should focus on exploring how to narrow this gap, such as using methods such as transfer
                     learning and imitation learning to make DRL's policy transfer between different environments
                     more stable or adopting model-free online DRL, which enables the agent to continuously
                     learn and optimize the control policy directly in the real grid, thus improving its
                     practical adaptability. In addition, the interpretability problem of DRL remains a
                     key challenge limiting its application in grid control, as its decision logic is often
                     hard to understand, making power engineers hesitant to trust its policies, thus affecting
                     deployment in safety-critical systems. To address this, approaches like attention
                     mechanisms or causal inference can enhance interpretability, making decision-making
                     more transparent while providing a visualized basis, thereby improving the trustworthiness
                     and usefulness of DRLs in power systems.
                  
                  
                        
                        
Fig. 12. Safety learning and Sim2Real gap minimization.
                      
                
             
            
                  5. Conclusion	
               Amid global decarbonization efforts, modern power systems are becoming increasingly
                  complex due to the large-scale integration of renewable energy, posing significant
                  challenges to system stability and operational efficiency. DRL has emerged as a promising
                  solution to address these challenges, offering adaptive learning and decision-making
                  capabilities that surpass traditional optimization methods in high-dimensional and
                  dynamic environments. This paper provides a systematic overview of DRL applications
                  in power systems, with a particular focus on its optimization strategies for energy
                  dispatch, topology control, and emergency load shedding.
               
               Our findings highlight the significant advancements of DRL in optimizing these control
                  measures, demonstrating its potential to enhance power system stability, flexibility,
                  and resilience. However, key challenges remain, including multi-task coordination,
                  renewable energy integration, safety constraints, and Sim2Real transferability. Addressing
                  these challenges will ensure the practical deployment and effectiveness of DRL-based
                  solutions in real-world power systems.
               
               As power grids continue to evolve, the insights provided in this paper establish a
                  foundation for future research, guiding the development of more robust, efficient,
                  and safe DRL frameworks. Advancing these research directions will not only drive innovation
                  in power system control but also play a crucial role in supporting the transition
                  toward a more sustainable and intelligent energy infrastructure.
               
             
          
         
            
                  Acknowledgements
               
                  This research was supported in part by the KEPCO under the project entitled by “Development
                  of GW class voltage sourced DC linkage technology for improved interconnectivity and
                  carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in
                  part by  the Institute of Information & communications Technology Planning and Evaluation
                  (IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).
                  
                  			
               
             
            
                  
                     References
                  
                     
                        
                        R. Detchon and R. Van Leeuwen, “Policy: Bring sustainable energy to the developing
                           world,” Nature, vol. 508, no. 7496, pp. 309–311, 2014. DOI:10.1038/508309a

 
                     
                        
                        H. Hu, N. Xie, D. Fang and X. Zhang, “The role of renewable energy consumption and
                           commercial services trade in carbon dioxide reduction: Evidence from 25 developing
                           countries,” Applied energy, vol. 211, pp. 1229–1244, 2018. DOI:10.1016/j.apenergy.2017.12.019

 
                     
                        
                        J. Wu, J. Yan, H. Jia, N. Hatziargyriou, N. Djilali and H. Sun, “Integrated energy
                           systems,” Applied energy, vol. 167, pp. 155–157, 2016. DOI:10.1016/j.apenergy.2016.02.075

 
                     
                        
                        B. Kroposki, “Integrating high levels of variable renewable energy into electric power
                           systems,” Journal of Modern Power Systems and Clean Energy, vol. 5, no. 6, pp. 831–837,
                           2017. DOI:10.1007/s40565-017-0339-3

 
                     
                        
                        M. L. Tuballa and M. L. Abundo, “A review of the development of smart grid technologies,”
                           Renewable and Sustainable Energy Reviews, vol. 59, pp. 710–725, 2016. DOI:10.1016/j.rser.2016.01.011

 
                     
                        
                        J. Keirstead, M. Jennings and A. Sivakumar, “A review of urban energy system models:
                           Approaches, challenges and opportunities,” Renewable and Sustainable Energy Reviews,
                           vol. 16, no. 6, pp. 3847–3866, 2012. DOI:10.1016/j.rser.2012.02.047

 
                     
                        
                        M. F. Zia, E. Elbouchikhi and M. Benbouzid, “Microgrids energy management systems:
                           A critical review on methods, solutions, and prospects,” Applied energy, vol. 222,
                           pp. 1033–1055, 2018. DOI:10.1016/j.apenergy.2018.04.103

 
                     
                        
                        S. Impram, S. V. Nese and B. Oral, “Challenges of renewable energy penetration on
                           power system flexibility: A survey,” Energy Strategy Reviews, vol. 31,   no. 100539,
                           pp. 1-12, 2020. DOI:10.1016/j.esr.2020.100539

 
                     
                        
                        D. Liu, Q. Yang, Y. Chen, X. Chen and J. Wen, “Optimal parameters and placement of
                           hybrid energy storage systems for frequency stability improvement,” Protection and
                           Control of Modern Power Systems, vol. 10, no. 2, pp. 40–53, 2025. DOI:10.23919/PCMP.2023.000259

 
                     
                        
                        K. Liu, Z. Chen, X. Li and Y. Gao, “Analysis and control parameters optimization of
                           wind turbines participating in power system primary frequency regulation with the
                           consideration of secondary frequency drop,” Energies, vol. 18, no. 6, pp. 1–19, 2025.
                           DOI:10.3390/en18061317

 
                     
                        
                        M. Dahane, A. Benali, H. Tedjini, A. Benhammou, M. A. Hartani and H. Rezk, “Optimized
                           double-stage fractional order controllers for dfig-based wind energy systems: A comparative
                           study,” Results in Engineering, vol. 25, no. 104584, pp. 1-17, 2025. DOI:10.1016/j.rineng.2025.104584

 
                     
                        
                        L. Cheng and T. Yu, “A new generation of ai: A review and perspective on machine learning
                           technologies applied to smart energy and electric power systems,” International Journal
                           of Energy Research, vol. 43, no. 6, pp. 1928–1973, 2019. DOI:10.1002/er.4333

 
                     
                        
                        M. M. Gajjala and A. Ahmad, “A survey on recent advances in transmission congestion
                           management,” International Review of Applied Sciences and Engineering, vol. 13, no.
                           1, pp. 29–41, 2021. DOI:10.1556/1848.2021.00286

 
                     
                        
                        H. Zhang, X. Sun, M. H. Lee and J. Moon, “Deep reinforcement learning based active
                           network management and emergency load-shedding control for power systems,” IEEE Transactions
                           on Smart Grid, vol. 15, no. 2, pp. 1423-1437, 2023. DOI:10.1109/TSG.2023.3302846

 
                     
                        
                        S. M. Mohseni-Bonab, I. Kamwa, A. Rabiee and C. Chung, “Stochastic optimal transmission
                           switching: A novel approach to enhance power grid security margins through vulnerability
                           mitigation under renewables uncertainties,” Applied Energy, vol. 305, no. 117851,
                           pp. 1-14, 2022.DOI:10.1016/j.apenergy.2021.117851

 
                     
                        
                        D. Michaelson, H. Mahmood and J. Jiang, “A predictive energy management system using
                           pre-emptive load shedding for islanded photovoltaic microgrids,” IEEE Transactions
                           on Industrial Electronics, vol. 64, no. 7, pp. 5440–5448, 2017. DOI:10.1109/TIE.2017.2677317

 
                     
                        
                        R. S. Sutton, “Reinforcement learning: An introduction,” A Bradford Book, 2018. DOI:10.1017/S0263574799271172

 
                     
                        
                        D. Cao, W. Hu, J. Zhao, G. Zhang, B. Zhang, Z. Liu, Z. Chen and F. Blaabjerg, “Reinforcement
                           learning and its applications in modern power and energy systems: A review,” Journal
                           of modern power systems and clean energy, vol. 8, no. 6, pp. 1029–1042, 2020. DOI:10.35833/MPCE.2020.000552

 
                     
                        
                        E. Mocanu, D. C. Mocanu, P. H. Nguyen, A. Liotta, M. E. Webber, M. Gibescu and J.
                           G. Slootweg, “On-line building energy optimization using deep reinforcement learning,”
                           IEEE transactions on smart grid, vol. 10, no. 4, pp. 3698–3708, 2018. DOI:10.1109/TSG.2018.2834219

 
                     
                        
                        Y. Zhang, X. Wang, J. Wang and Y. Zhang, “Deep reinforcement learning based volt-var
                           optimization in smart distribution systems,” IEEE Transactions on Smart Grid, vol.
                           12, no. 1, pp. 361–371, 2020. DOI:10.1109/TSG.2020.3010130

 
                     
                        
                        Z. Yan and Y. Xu, “Data-driven load frequency control for stochastic power systems:
                           A deep reinforcement learning method with continuous action search,” IEEE Transactions
                           on Power Systems, vol. 34, no. 2, pp. 1653–1656, 2018. DOI:10.1109/TPWRS.2018.2881359

 
                     
                        
                        Q. Huang, R. Huang, W. Hao, J. Tan, R. Fan and Z. Huang, “Adaptive power system emergency
                           control using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol.
                           11, no. 2, pp. 1171–1182, 2019. DOI:10.1109/TSG.2019.2933191

 
                     
                        
                        Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for power system applications:
                           An overview,” CSEE Journal of Power and Energy Systems, vol. 6, no. 1, pp. 213–225,
                           2019. DOI:10.17775/CSEEJPES.2019.00920

 
                     
                        
                        Q. Li, T. Lin, Q. Yu, H. Du, J. Li and X. Fu, “Review of deep reinforcement learning
                           and its application in modern renewable power system control,” Energies, vol. 16,
                           no. 10, pp. 1–23, 2023. DOI:10.3390/en16104143

 
                     
                        
                        J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Machine
                           learning, vol. 16, pp. 185–202, 1994. DOI:10.1007/BF00993306

 
                     
                        
                        A. Agarwal, S. M. Kakade, J. D. Lee and G. Mahajan, “Optimality and approximation
                           with policy gradient methods in markov decision processes,” in Conference on Learning
                           Theory.PMLR, vol. 125, pp. 64–66, 2020. https://proceedings.mlr.press/v125/agarwal20a.html

 
                     
                        
                        H. Wang and B. Raj, “On the origin of deep learning,” arXiv preprint arXiv:1702.07800,
                           2017. DOI:10.48550/arXiv.1702.07800

 
                     
                        
                        Y. LeCun, Y. Bengio and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp.
                           436–444, 2015. DOI:10.1038/nature14539

 
                     
                        
                        R. Sun, “Optimization for deep learning: theory and algorithms,” arXiv preprint arXiv:1912.08957,
                           2019. DOI:10.48550/arXiv.1912.08957

 
                     
                        
                        J. Tsitsiklis and B. Van Roy, “Analysis of temporal-diffference learning with function
                           approximation,” Advances in neural information processing systems, vol. 9, pp. 1-7,
                           1996. https://proceedings.neurips.cc/paper_files/paper/1996/file/e00406144c1e7e35240afed70f34166a-Paper.pdf

 
                     
                        
                        C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, pp. 279–292, 1992.
                           DOI:10.1007/BF00992698

 
                     
                        
                        V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
                           M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through
                           deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:10.1038/nature14236

 
                     
                        
                        J. Fan, Z. Wang, Y. Xie and Z. Yang, “A theoretical analysis of deep q-learning,”
                           in Learning for dynamics and control. PMLR, vol. 120, pp. 486–489, 2020. https://proceedings.mlr.press/v120/yang20a.html

 
                     
                        
                        H. Van Hasselt, A. Guez and D. Silver, “Deep reinforcement learning with double q-learning,”
                           in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1,
                           pp. 2094-2100, 2016. DOI:10.1609/aaai.v30i1.10295

 
                     
                        
                        Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot and N. Freitas, “Dueling network
                           architectures for deep reinforcement learning,” in International conference on machine
                           learning.PMLR, vol. 48, pp. 1995–2003, 2016. https://proceedings.mlr.press/v48/wangf16.html

 
                     
                        
                        D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra and M. Riedmiller, “Deterministic
                           policy gradient algorithms,” in International conference on machine learning. PMLR,
                           vol. 32, no. 1, pp. 387–395, 2014. https://proceedings.mlr.press/v32/silver14.html

 
                     
                        
                        R. S. Sutton, D. McAllester, S. Singh and Y. Mansour, “Policy gradient methods for
                           reinforcement learning with function approximation,” Advances in neural information
                           processing systems, vol. 12, pp. 1-7, 1999. https://proceedings.neurips.cc/paper_files/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

 
                     
                        
                        T. Degris, M. White and R. S. Sutton, “Off-policy actor-critic,” arXiv preprint arXiv:1205.4839,
                           2012. DOI:10.48550/arXiv.1205.4839

 
                     
                        
                        S. Li, S. Bing and S. Yang, “Distributional advantage actor-critic,” arXiv preprint
                           arXiv:1806.06914, 2018. DOI:10.48550/arXiv.1806.06914

 
                     
                        
                        V. Mnih, “Asynchronous methods for deep reinforcement learning,” arXiv preprint arXiv:1602.01783,
                           2016. DOI:10.48550/arXiv.1602.01783

 
                     
                        
                        T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu,
                           A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv
                           preprint arXiv:1812.05905, 2018. DOI:10.48550/arXiv.1812.05905

 
                     
                        
                        J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization
                           algorithms,” arXiv preprint arXiv:1707.06347, 2017. DOI:10.48550/arXiv.1707.06347

 
                     
                        
                        S. Pateria, B. Subagdja, A.-h. Tan and C. Quek, “Hierarchical reinforcement learning:
                           A comprehensive survey,” ACM Computing Surveys (CSUR), vol. 54, no. 5, pp. 1–35, 2021.
                           DOI:10.1145/3453160

 
                     
                        
                        S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang and A. Knoll, “A review of safe
                           reinforcement learning: Methods, theories and applications,” IEEE Transactions on
                           Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 11216–11235, 2024.
                           DOI:10.1109/TPAMI.2024.3457538

 
                     
                        
                        F. Meng, Y. Bai and J. Jin, “An advanced real-time dispatching strategy for a distributed
                           energy system based on the reinforcement learning algorithm,” Renewable energy, vol.
                           178, pp. 13–24, 2021. DOI:10.1016/j.renene.2021.06.032

 
                     
                        
                        T. Yang, L. Zhao, W. Li and A. Y. Zomaya, “Dynamic energy dispatch strategy for integrated
                           energy system based on improved deep reinforcement learning,” Energy, vol. 235, no.
                           121377, pp. 1-15, 2021. DOI:10.1016/j.energy.2021.121377

 
                     
                        
                        A. S. Ebrie and Y. J. Kim, “Reinforcement learning-based optimization for power scheduling
                           in a renewable energy connected grid,” Renewable Energy, vol. 230, no. 120886, pp.
                           1-27, 2024. DOI:10.1016/j.renene.2024.120886

 
                     
                        
                        X. Han, C. Mu, J. Yan and Z. Niu, “An autonomous control technology based on deep
                           reinforcement learning for optimal active power dispatch,” International Journal of
                           Electrical Power & Energy Systems, vol. 145, no. 108686, pp. 1-10, 2023. DOI:10.1016/j.ijepes.2022.108686

 
                     
                        
                        X. Zhou, J. Wang, X. Wang and S. Chen, “Optimal dispatch of integrated energy system
                           based on deep reinforcement learning,” Energy Reports, vol. 9, pp. 373–378, 2023.
                           DOI:10.1016/j.egyr.2023.09.157

 
                     
                        
                        I. Damjanović, I. Pavić, M. Puljiz and M. Brcic, “Deep reinforcement learning-based
                           approach for autonomous power flow control using only topology changes,” Energies,
                           vol. 15, no. 19, pp. 1-16, 2022. DOI:10.3390/en15196920

 
                     
                        
                        M. Subramanian, J. Viebahn, S. H. Tindemans, B. Donnot and A. Marot, “Exploring grid
                           topology reconfiguration using a simple deep reinforcement learning approach,” in
                           2021 IEEE Madrid PowerTech, pp. 1–6, 2021. DOI:10.1109/PowerTech46648.2021.9494879

 
                     
                        
                        Z. Yang, Z. Qiu, Y. Wang, C. Yan, X. Yang and G. Deconinck, “Power grid topology regulation
                           method based on hierarchical reinforcement learning,” in 2024 Second International
                           Conference on Cyber-Energy Systems and Intelligent Energy (ICCSIE), pp. 1–6, 2024.
                           DOI:10.1109/ICCSIE61360.2024.10698617

 
                     
                        
                        Z. Qiu, Y. Zhao, W. Shi, F. Su and Z. Zhu, “Distribution network topology control
                           using attention mechanism-based deep reinforcement learning,” in 2022 4th International
                           Conference on Electrical Engineering and Control Technologies (CEECT), pp. 55–60,
                           2022. DOI:10.1109/CEECT55960.2022.10030642

 
                     
                        
                        X. Han, Y. Hao, Z. Chong, S. Ma and C. Mu, “Deep reinforcement learning based autonomous
                           control approach for power system topology optimization,” in 2022 41st Chinese Control
                           Conference (CCC), pp. 6041–6046, 2022. DOI:10.23919/CCC55666.2022.9902073

 
                     
                        
                        R. Huang, Y. Chen, T. Yin, X. Li, A. Li, J. Tan, W. Yu, Y. Liu and Q. Huang, “Accelerated
                           deep reinforcement learning based load shedding for emergency voltage control,” arXiv
                           preprint arXiv:2006.12667, 2020. DOI:10.48550/arXiv.2006.12667

 
                     
                        
                        Y. Pei, J. Yang, J. Wang, P. Xu, T. Zhou and F. Wu, “An emergency control strategy
                           for undervoltage load shedding of power system: A graph deep reinforcement learning
                           method,” IET Generation, Transmission & Distribution, vol. 17, no. 9, pp. 2130–2141,
                           2023. DOI:10.1049/gtd2.12795

 
                     
                        
                        J. Li, S. Chen, X. Wang and T. Pu, “Load shedding control strategy in power grid emergency
                           state based on deep reinforcement learning,” CSEE Journal of Power and Energy Systems,
                           vol. 8, no. 4, pp. 1175–1182, 2021. DOI:10.17775/CSEEJPES.2020.06120

 
                     
                        
                        J. Zhang, Y. Luo, B. Wang, C. Lu, J. Si and J. Song, “Deep reinforcement learning
                           for load shedding against short-term voltage instability in large power systems,”
                           IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 8, pp. 4249–4260,
                           2021. DOI:10.1109/TNNLS.2021.3121757

 
                     
                        
                        H. Chen, J. Zhuang, G. Zhou, Y. Wang, Z. Sun and Y. Levron, “Emergency load shedding
                           strategy for high renewable energy penetrated power systems based on deep reinforcement
                           learning,” Energy Reports, vol. 9, pp. 434–443, 2023. DOI:10.1016/j.egyr.2023.03.027

 
                     
                        
                        Z. Hu, Z. Shi, L. Zeng, W. Yao, Y. Tang and J. Wen, “Knowledge-enhanced deep reinforcement
                           learning for intelligent event-based load shedding,” International Journal of Electrical
                           Power & Energy Systems, vol. 148, no. 108978, pp. 1-11, 2023. DOI:10.1016/j.ijepes.2023.108978

 
                     
                        
                        Y. Zhang, M. Yue and J. Wang, “Adaptive load shedding for grid emergency control via
                           deep reinforcement learning,” in 2021 IEEE Power & Energy Society General Meeting
                           (PESGM). pp. 1-5, 2021. DOI:10.1109/PESGM46819.2021.9638058

 
                   
                
             
            저자소개
            
            Haotian Zhang received the B.S. degree in mechanical engineering from Qingdao University
               of Science and Technology, Qingdao, China, and Hanyang University, Ansan, South Korea,
               in 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang
               University, Seoul, South Korea. His research interests include optimal control, smart
               grid, deep reinforcement learning, and their applications.
            
            
            Chen Wang received the B.S. degree in electronics and computer engineering and the
               M.S. degree in electronic computer engineering from Chonnam National University, South
               Korea, in 2020 and 2022. He is currently pursuing the Ph.D. degree in electrical engineering
               at Hanyang University, Seoul, South Korea. His research interests include smart grid,
               deep reinforcement learning, and their applications.
            
            
            Minju Lee received the B.S. degree in climate and energy systems engineering from
               Ewha Womans University, Seoul, South Korea, in 2022, where she is currently pursuing
               the degree with the Department of Climate and Energy Systems Engineering. Her research
               interests include short-term wind power forecasting and the probabilistic estimation
               of transmission congestion for grid integration.
            
            
            Myoung Hoon Lee received the B.S. degree in electrical engineering from Kyungpook
               National University, Daegu, South Korea, in 2016, and the Ph.D. degree in electrical
               engineering from the Ulsan National Institute of Science and Technology, Ulsan, South
               Korea, in 2021. From 2021 to 2023, he was a Postdoctoral Research Fellow with the
               Research Institute of Electrical and Computer Engineering, Hanyang University, Seoul,
               South Korea. He is currently an Assistant Professor with the Department of Electrical
               Engineering, Incheon National University, Incheon, South Korea. His research interests
               include decentralized optimal control, mean field games, deep reinforcement learning,
               and their applications.
            
            
            Jun Moon is currently an Associate Professor in the Department of Electrical Engineering
               at Hanyang University, Seoul, South Korea. He received the B.S. degree in electrical
               and computer engineering, and the M.S. degree in electrical engineering from Hanyang
               University, Seoul, South Korea, in 2006 and 2008, respectively. He received the Ph.D.
               degree in electrical and computer engineering from University of Illinois at Urbana-Champaign,
               USA, in 2015. From 2008 to 2011, he was a researcher at Agency for Defense Development
               (ADD) in South Korea. From 2016 to 2019, he was with the School of Electrical and
               Computer Engineering, Ulsan National Institute of Science and Technology (UNIST),
               South Korea, as an assistant professor. From 2019 to 2020, he was with the School
               of Electrical and Computer Engineering, University of Seoul, South Korea, as an associate
               professor. He is a recipient of the Fulbright Graduate Study Award 2011. His research
               interests include stochastic optimal control and filtering, reinforcement learning,
               data-driven control, distributed control, networked control systems, and mean field
               games.