왕천
                     (Chen Wang)
                     1iD
                     장호천
                     (Haotian Zhang)
                     1iD
                     이민주
                     (Minju Lee)
                     2iD
                     이명훈
                     (Myoung Hoon Lee)
                     ††iD
                     문준
                     (Jun Moon)
                     †iD
               
                  - 
                           
                        (Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of
                        Korea.)
                        
- 
                           
                        (KEPCO Research Institute, Daejeon Metropolitan City 34056, Republic of Korea.)
                        
 
            
            
            Copyright © The Korea Institute for Structural Maintenance and Inspection
            
            
            
            
            
               
                  
Key words
               
                Deep reinforcement learning,  option-critic framework,  topology control,  smart grid.
             
            
          
         
            
                  1. Introduction       	
               As renewable energy sources like wind and solar are increasingly integrated into power
                  systems, ensuring efficient and secure power transmission has become more challenging
                  [1,2]. In this context, traditional model-based control and management methods for power
                  systems are beginning to show their limitations. Recently, with the rise of neural
                  networks (NN), deep reinforcement learning (DRL) control methods have gained significant
                  attention [3]. Several studies have explored the use of DRL models for optimizing and controlling
                  the power grid [4-6]. Specifically, [4] investigated a multi-agent residential smart grid and proposed an optimal policy
                  for minimizing economic costs using parametric model predictive control (MPC) under
                  a deterministic policy gradient (DPG)-based reinforcement learning algorithm. [5] applied a model-based and data-driven DRL algorithm to resolve line overload issues
                  in power systems, while [6] introduced an emergency load shedding method based on a deep deterministic policy
                  gradient (DDPG) algorithm to enhance stable power system operations through autonomous
                  voltage control.
               
               As power systems continue to develop and expand, their size and complexity are showing
                  a steady increase [7]. In this context, conventional model-based automatic control methods encounter increasing
                  challenges in meeting the demands of power grid operations. Traditional control approaches
                  primarily focus on regulating generators and loads. However, as the power network
                  expands, these methods may lack flexibility, especially when dealing with the grid's
                  variability, efficient energy integration, and system security concerns [8-10]. In light of these challenges, topology control methods have gained considerable
                  attention. Compared to other control strategies, topology control offers a more cost-effective
                  approach to managing the power grid. This method reconfigures the grid structure by
                  adjusting the connections of power lines and the distribution of buses, which effectively
                  reduces congestion and
               
               enhances power transmission efficiency. A distinct advantage of topology control is
                  its ability to quickly respond to changes in grid topology, helping to lower the risk
                  of system failures and improve the overall stability and robustness of the power system
                  [11,12]. This method not only enhances the flexibility and sustainability of power systems
                  but also offers opportunities to improve system performance. As a result, topology
                  control plays a crucial role in modernizing and optimizing power systems.
               
               
                     1.1 Related works
                  Classic control methods in power systems, such as MPC and proportional integral derivative
                     (PID) control, are highly dependent on a detailed dynamic model of the system, which
                     needs to be accurately represented by an accurate mathematical model for optimal decision
                     making. As a result, building such models in complex or highly dynamic environments
                     is often a huge challenge. In contrast, DRL, as a data-driven approach can effectively
                     circumvent the need for detailed mathematical models and efficiently cope with the
                     complexity and uncertainty of power system environments by utilizing a large amount
                     of data and an iterative trial-and-error approach to find the optimal control policy.
                     In recent years, the rapid advancement of DRL has led to the widespread application
                     of various baseline DRL algorithms in power system control. These algorithms encompass
                     deep Q-network (DQN) [15], proximal policy optimization (PPO) [14], and soft actor-critic (SAC) [16], all of which have demonstrated outstanding performance across various power system
                     control scenarios. Specifically, in [17], the authors utilized the DQN and double deep Q-network (DDQN) algorithms for scheduling
                     household appliances to determine the optimal energy scheduling policy. [18] introduces a novel method for addressing the alternating current optimal power flow
                     problem by employing an advanced PPO algorithm, which aids grid operators in rapidly
                     developing accurate control policies to ensure system security and economic efficiency.
                     To address uncertainties such as the intermittency of wind energy and load flexibility,
                     [19] applies the SAC algorithm for energy dispatch optimization.
                  
                  Additionally, [2] introduced ‘Grid2Op’, a power system simulation platform designed to address control
                     challenges in power systems using artificial intelligence. ‘Grid2Op’ is an open-source
                     framework compatible with the OpenAI Gym [20], offering a convenient tool for building and controlling power systems with reinforcement
                     learning algorithms. Specifically, ‘Grid2Op’ integrates seamlessly with DRL algorithms
                     to enable effective power system management by intelligently regulating power line
                     switching states or bus distribution. A key advantage of using the ‘Grid2Op’ platform
                     is that it allows experiments with real power system data, making it possible to train
                     and
                  
                  simulate using any DRL algorithm. As a result, ‘Grid2Op’ has become one of the leading
                     simulation platforms for power systems, offering scenario-based simulations that provide
                     higher realism and credibility for experimental results [2,21].
                  
                  Among the studies on power system topology control using DRL in the “Grid2Op” platform,
                     the authors in [12] employed the cross-entropy method (CEM) reinforcement learning algorithm to manage
                     power flow through topology switching actions, analyzing the variability and types
                     of topologies. In [22], the authors applied the dueling double deep Q-network algorithm combined with a
                     prioritized replay mechanism to control topology changes in power systems, achieving
                     notable results. Additionally, [23] proposed a method that combines imitation learning with the SAC deep reinforcement
                     learning algorithm to enable stable autonomous control in the IEEE 118-Bus power system,
                     demonstrating the method's effectiveness and robustness in topology optimization.
                  
                
               
                     1.2 Main Contributions
                  In this paper, an option-critic based DRL method for topology control policy in power
                     systems is proposed, with the main contributions outlined as follows:
                  
                  1. We integrate the option-critic (OC) algorithm with long short-term memory (LSTM)
                     neural networks to capture time-series features in high-dimensional power system environments.
                     The LSTM networks help model these features, while the option-based DRL algorithm
                     decomposes the large and complex action space into executable options, effectively
                     reducing the dimensionality of the action space in the power system.
                  
                  2. We apply the proposed OC-LSTM algorithm to the topology control policy in the power
                     system and compare it with a baseline DRL algorithm. Experimental results demonstrate
                     that the OC-LSTM method enables stable operation of the IEEE 5-Bus, IEEE 14-Bus, and
                     L2RPN WCCI 2020 power system environments for 60 hours, without requiring any human
                     expert intervention.
                  
                  To evaluate the performance of the OC-LSTM, we conducted simulation experiments on
                     the ‘Grid2Op’ platform, using three power system simulation environments of varying
                     scales. We then compared our algorithm with baseline DRL algorithms, namely dueling
                     double deep Q-network (DDDQN) [13] and proximal policy optimization (PPO) [14], to validate the effectiveness of OC-LSTM. Furthermore, to assess the OC-LSTM's effectiveness
                     in addressing the power system topology control problem, we compared the optimal model
                     of our algorithm against the $Do Nothing Action$. The results demonstrate that OC-LSTM
                     can maintain stable operation for a longer period than $Do Nothing Action$, without
                     the need for human expert intervention.
                  
                  The paper is organized as follows: In Section 2, we introduce the OC framework and
                     the LSTM network, followed by the proposed OC with LSTM algorithm (OC-LSTM). Section
                     3 describes the implementation of power system topology control simulation experiments
                     and validates the effectiveness of the proposed methodology through simulation results.
                     Finally, we conclude the paper in Section 4.
                  
                
             
            
                  2. Option-Based Reinforcement Learning with Long Short-Term Memory	
               In this section, we introduce the Option-Critic (OC) framework in detail, which is
                  followed by the core constituent units of the Long Short-Term Memory (LSTM) neural
                  networks. By combining these two advantages, we propose the OC-LSTM algorithm, which
                  combines policy hierarchy with temporal feature extraction to effectively improve
                  the stability and performance of power system topology control.
               
               
                     2.1 Deep Reinforcement Learning
                  Deep Reinforcement Learning (DRL) can be described as a Markov Decision Process (MDP)
                     formalized as a 5-tuple $<S,\: A,\: P,\: R,\: \gamma >$, where $S$ denotes the state
                     space, $A$ denotes the action space, and $P$ denotes the probability of state transfer,
                     defined as the probability of transferring to the next state after taking an action
                     in the current state; $R$ is a reward function representing the feedback provided
                     by the environment at each moment; $\gamma$ is a discount factor to measure the importance
                     of future rewards. 
                  
                  At each time step $t$, the policy function $\pi_{t}$ determines the action $a_{t}$
                     chosen by the intelligent in state $s_{t}$. The agent interacts with the environment
                     through the policy $\pi$. After executing an action, the environment updates the state
                     and provides the corresponding reward feedback $r_{t}$. The main goal of DRL is to
                     learn the optimal policy function by maximizing the cumulative reward. For this, the
                     state-value function $V_{\pi}$  and the state-action-value function $Q_{\pi}$ are
                     introduced to evaluate the policy quality and predict future rewards. 
                  
                  Specifically, the state-value function $V_{\pi}$ represents the expected cumulative
                     reward starting from state $s$ and following policy $\pi$ :
                  
                  
                  In contrast, the state-action-value function $Q_{\pi}$ extends this concept by considering
                     the expected cumulative reward from taking a specific action $a$ at state $s$ and
                     then following the policy $\pi$.
                  
                  
                  The difference is that $V_{\pi}$ evaluates the value of being in a specific state,
                     while $Q_{\pi}$ evaluates the value of taking an action in a specific state.
                  
                
               
                     2.2 Option-Critic Framework
                  The OC framework [22] is a policy gradient-based approach for simultaneously learning the intra-option
                     policy and the option termination function. The method allows agent to employ a hierarchical
                     decision structure in task learning and execution, enabling them to make more accurate
                     decisions over longer time horizons and thus be more adaptable to complex power system
                     environments. A Markov option $\omega\inΩ$ can be represented as a tuple $(I_{\omega},\:
                     \pi_{\omega},\: \beta_{\omega})$, where $I_{\omega}$ is the initialization set, $\pi_{\omega}$
                     is the intra-option policy, and $\left.\beta_{\omega}:S\right.→[0,\: 1]$ defines the
                     termination probability of the option at each state.
                  
                  The OC framework selects an option based on the option policy $\pi_{Ω}$ and executes
                     its policy within that option $\pi_{\omega}$ until the termination condition is met.
                     Once the current option terminates, the system selects the next option and continues
                     the process. The internal strategy $\pi_{\omega ,\: \theta}$ of an option is parametrically
                     represented by the parameter $\theta$, while the termination condition $\pi_{\omega
                     ,\: v}$ of an option is parametrically represented by the parameter $v$. The state-option-action
                     value function $Q_{U}(s_{t},\: w,\: a)$ represents the expected cumulative return
                     when option $\omega$ is selected in state st and action $a$ is executed. This function
                     can be expressed as the following equation: 
                  
                  
                  In (3), note that the value of the state $s_{t+1}$ reached through the option $\omega$ can
                     be expressed as follows:
                  
                  
                  where $Q_{Ω}(s_{t},\:  w)=\sum_{a}\pi_{\omega ,\:\theta}(a,\: s_{t})Q_{U}(s_{t},\:
                     \omega ,\:  a)$ represents the option value function, and  $V_{Ω}(s_{t})=\sum_{\omega}\pi_{\omega}(\omega
                     ,\: s_{t})Q_{Ω}(s_{t},\: \omega)$  represents the option-level state value function.
                     In this case, the option $\omega$ terminates with probability $\beta_{\omega ,\: v}(s_{t+1})$,
                     leading to the selection of a new option, or continues with the current option with
                     probability $1-\beta_{\omega ,\: v}(s_{t+1})$.
                  
                  The intra-option policies and option termination function can be learned by using
                     the policy gradient theorem to maximize the expected discounted return . In the case
                     where the initial condition is $(s_{0},\: \omega_{0})$, the gradient of the objective
                     function for the intra-option policy parameter $\theta$ can be expressed as:
                  
                  
                  Similarly, the gradient with respect to option termination parameters $v$ with initial
                     condition $(s_{1},\: \omega_{0})$ is:
                  
                  
                  
                  where $A_{Ω}(s,\: \omega)=Q_{Ω}(s,\: \omega)-V_{Ω}(s)$ is the advantage of choosing
                  option $\omega$ in state $s$.
                  
                  
                  
                  						
               
 
               
                     2.3 Long-Short Term Memory
                  Long Short-Term Memory networks (LSTM) [25] are a variant of Recurrent Neural Networks (RNN) [26] that effectively capture feature relationships in time series data while addressing
                     the issues of vanishing and exploding gradients that RNN encounter when processing
                     long sequences. One of the main challenges in applying DRL to power system control
                     is finding the optimal control strategy within the vast action and state spaces. To
                     overcome this challenge, we introduce LSTM networks, which can capture time-related
                     information relevant to the target task in high-dimensional state spaces and prevent
                     issues such as gradient vanishing during training.
                  
                  In the LSTM structure, the forget gate $f_{t}$, input gate $i_{t}$, input node $g_{t}$,
                     and output gate $o_{t}$ are defined at time $t$, with weights and bias values denoted
                     by $W$ and $b$, respectively. The sigmoid activation function and the tanh activation
                     function are represented by $\sigma$ and $\phi$. The following describes the input-output
                     mapping relationships of each node, with the specific workflow of the LSTM illustrated
                     in Fig. 1.
                  
                  
                        
                        
Fig. 1. LSTM network structure.
                      
                  First, the forget gate $f_{t}$ uses the sigmoid activation function to determine the
                     information to discard from the input $x_{t}$ and the previous output $h_{t-1}$, represented
                     as:
                  
                  
                  Next, the input gate $i_{t}$ similarly takes the input information $x_{t}$ and the
                     intermediate output $h_{t-1}$ from the previous time step, applying the sigmoid activation
                     function to decide what information to store. This is combined with the next input
                     node $g_{t}$ to determine what to retain, expressed as:
                  
                  
                  Subsequently, the internal state $s_{t-1}$ is updated to the new state $s_{t}$ using
                     the input gate $i_{t}$ and the forget gate $f_{t}$, as shown in the following equation:
                  
                  
                  Finally, the input information $x_{t}$ and the intermediate output $h_{t-1}$ serve
                     as inputs to the output gate $o_{t}$, which is processed through a sigmoid activation
                     function. The final output $h_{t}$ is generated based on the internal state $s_{t}$,
                     represented by:
                  
                  
                  The LSTM model effectively maintains the dependencies between information by storing
                     node information in the internal state through the cooperation of internal nodes.
                  
                
               
                     2.4 OC-LSTM : Option-Critic with Long Short-Term Memory
                  The OC-LSTM algorithm aims to solve the topology control problem in power systems
                     by combining the LSTM and Option-Critic frameworks, and its algorithm structure is
                     shown in Fig. 2.
                  
                  First, OC-LSTM learns and processes state inputs in the power system through LSTM
                     networks. LSTM can capture a large amount of key state information in the power system
                     that contains temporal features, such as power generation, load profiles, and changes
                     in the grid topology, to help the agent better understand the dynamic features of
                     the system.
                  
                  
                        
                        
Fig. 2. The overall OC-LSTM framework.
                      
                  Next, the extracted temporal state information is fed into the OC framework. In this
                     framework, the policy over option selects an appropriate option based on the current
                     state.
                  
                  Then, the intra-option policy determines the specific action to be executed and uses
                     the termination function to decide whether to continue with the current option or
                     terminate it and select a new option based on the policy over options.
                  
                  In this process, the critic evaluates the performance of the intra-option policy and
                     the termination function and updates the intra-option policy and the termination function
                     using the policy gradient to optimize the behavior of the agent.
                  
                   By combining the ability of LSTM to extract temporal state features with the OC hierarchical
                     policy framework, the OC-LSTM algorithm effectively copes with the challenges posed
                     by the increase in state space and action dimensions in power system topology control
                     problems. LSTM is able to capture the dynamic changes of key state information in
                     the power system, thus providing rich temporal feature information for decision making.
                     Meanwhile, the OC framework introduces a hierarchical decision structure, which enables
                     the agent to select and execute control policies more efficiently in the face of complex
                     action spaces. This combination enhances the stability of the agent in changing environments.
                  
                
             
            
                  3. Topology Control Problem and Simulation	
               Section 3 defines the power system topology control problem based on the Markov decision
                  process and details the simulation environment used in the experiments. Simulation
                  results verify the proposed method's performance and stability in power system topology
                  control.
               
               
                     3.1 Simulation environments
                  We will conduct simulations in three environments: IEEE 5-Bus, IEEE 14-Bus, and L2RPN
                     WCCI 2020, as shown in Fig. 3. These power system networks consist of generators (such as thermal power plants
                     and renewable energy facilities), loads (such as households and factories), and power
                     lines. Generators and loads are connected through substations, and different substations
                     are interconnected by power lines.
                  
                  Fig. 3(a) shows the structure of the IEEE 5-Bus simulation environment, includes 8 power lines,
                     3 loads, 2 generators, and 5 substations, with 20 scenarios. Each scenario has a time
                     interval of 5 minutes, allowing a maximum of 288 survival steps per day.
                  
                  
                        
                        
Fig. 3. Power system environments.
                      
                  Fig. 3(b) presents the IEEE 14-Bus simulation environment, which includes 20 power lines, 11
                     loads, 6 generators, and 14 substations. The IEEE 14-Bus simulation environment contains
                     1004 scenarios, each simulating the internal changes of the French power grid over
                     28 consecutive days. Similar to the IEEE 5-Bus environment, this simulation environment
                     allows a maximum of 288 survival steps per day. Each step records information on grid
                     topology and generator output, with details provided in the observation space section.
                     Consecutive steps capture the dynamic changes within the power system.
                  
                  Fig. 3(c) illustrates the L2RPN WCCI 2020 environment, which is based on part of the IEEE 118-Bus
                     network and includes 59 power lines, 37 loads, 22 generators, and 36 substations.
                     The simulation dataset for the L2RPN WCCI 2020 environment covers 240 years of time
                     data, with the same number of survival steps in its simulation setup as the two aforementioned
                     environments.
                  
                  All the above power system simulations are modeled on the “Grid2Op” platform to ensure
                     standardized and consistent experiments [2,28].
                  
                
               
                     3.2 Define topology control with MDP
                  Observation space : We set up a unified observation space for the power system environment,
                     including v_or, v_ex, a_or, a_ex, ρ, and line_status. where v_or and v_ex denote the
                     voltage values on the buses connected to the start and end points of each power line,
                     a_or and a_ex denote the currents on the buses connected to the start and end points
                     of each power line, ρ denotes the thermal capacity of each power line, and line_status
                     denotes the connection status of each power line. Such a design enables the agent
                     to fully capture the critical dynamics of the power system currents, thus providing
                     an accurate decision basis for the optimal topology control problem. The observation
                     space dimensions of the three power system environments are shown in Table 1.
                  
                  
                        
                        
Table 1 Dimensions of observation and action space in three power system environments.
                     
                     
                           
                              
                                 |  | Observation Space | Action Space | 
                           
                                 | IEEE 5-Bus | 69 | 9 | 
                           
                                 | IEEE 14-Bus | 177 | 21 | 
                           
                                 | L2RPN WCCI 2020 | 531 | 60 | 
                        
                     
                   
                  Action space : We define the action space of the power system environment as change_line_status,
                     which represents changing the status of a power line (connected/disconnected), and
                     this action belongs to the discrete action type. This design is consistent with the
                     goal of this study, which is to achieve stable power system operation through topology
                     control of power lines. The action dimension consists of $A=(A_{Do Nothin g},\: A_{{l}in{e}})$,
                     where $A_{{l}in{e}}$ denotes the number of power lines that can be connected or disconnected,
                     and $A_{Do Nothin g}$ denotes an action that does not perform any operation. The action
                     space dimensions in the three power system simulation environments are shown in Table 1.
                  
                  Reward function : We defined the reward function for the power system environment
                     (as shown in Equation (12)), which consists of two parts: feedback on the agent's behavior and the utilization
                     rate of line capacity. Specifically, if the agent's policy causes the power system
                     environment to terminate prematurely or if illegal actions are taken, the agent will
                     receive a minimum reward of 0. On the other hand, the reward calculation for line
                     capacity utilization is shown in Equation (11), where $\rho_{i}$ denotes the utilization rate of each power line   ($\rho_{i}\in(0,\:
                     1)$) and $line_{i}$ denotes the connection status of each power line. This equation
                     reflects the capacity utilization of the power lines. When the utilization of all
                     lines is close to the maximum capacity, the reward value is close to 0. Conversely,
                     the lower the line utilization, the higher the reward value. This reward function
                     is designed to allow the agent to avoid overloading the power lines, thus maintaining
                     the stability of the power system.
                  
                  
                  
                  where $x\in(0,\: 1)$, $A^{illegal \;or\; ambiguous}$ represents        the set of
                     illegal or ambiguous actions as defined by the Grid2Op simulation platform and $S^{error}$
                     represents the premature termination of the environment.
                  
                  To ensure that the agent learns to control the power grid safely and effectively,
                     it operates within a strictly constrained environment to maintain the grid's normal
                     operation. These constraints correspond to real-world operational limitations, and
                     violations will result in the termination of the environment. The constraints include:
                  
                  1. Disconnecting too many power lines, causing an inability to meet load demand.
                  2. Interrupting the connection between generators and the grid.
                  3. Creating grid islands due to topological changes.
                
               
                     3.3 Simulation results
                  Fig. 4 presents the training results across three different power system environments, where
                     we compare the OC-LSTM algorithm with two typical DRL baseline algorithms: PPO and
                     DDDQN. DDDQN is an improved version of the DQN algorithm, which significantly outperforms
                     the original DQN in terms of performance. On the other hand, PPO is a widely used
                     policy optimization algorithm that maintains training stability and sampling efficiency
                     while gradually converging to the optimal policy. Both baseline algorithms are suitable
                     for handling high-dimensional  state spaces and discrete action spaces. To ensure
                     the robustness of the results, each algorithm
                  
                  
                        
                        
Fig. 4. Training results of OC-LSTM (ours) and baseline algorithms in (a) IEEE 5-Bus,
                           (b) IEEE 14-Bus and (c) L2RPN WCCI 2020 environments. 
                        
                      
                  was trained   three times. In the training process, we use the survival steps as the
                     evaluation criterion, as this metric can intuitively reflect the agent's control capability.
                     In the simulation, the maximum operation time of the power system is set to 4 days,
                     with an environment sampling interval of 5 minutes. Therefore, the maximum survival
                     steps per episode is 1152. A longer survival step indicates that the agent can maintain
                     stable operation in the power system for a longer time through topology control, reflecting
                     stronger topology control performance. Conversely, poor topology control policies
                     will struggle to maintain the stable operation of the power system, resulting in a
                     lower number of survival steps.
                  
                  
                        3.3.1 Training
                     As shown in Fig. 4(a), the OC-LSTM algorithm and the PPO algorithm perform similar survival steps in the
                        IEEE 5-Bus environment. This indicates that both the OC-LSTM algorithm and the PPO
                        algorithm are effective in maintaining the stability of the grid in topology control
                        tasks for small-scale grids. In contrast, the DDDQN algorithm performs poorly in this
                        setting, possibly due to its more limited decision making in topology control, which
                        tends to fall into local optimality.
                     
                     Fig. 4(b) shows the results in the IEEE 14-Bus environment. In this medium-scale power system,
                        the OC-LSTM algorithm significantly outperforms PPO and DDDQN and is able to maintain
                        stability in the number of surviving steps, reaching more than 720 survival steps,
                        which is equivalent to about 60 hours of operation time. This further validates that
                        the OC-LSTM algorithm can still maintain power system stability even as the state
                        and action spaces of the environment increase.
                     
                     Fig. 4(c) shows the training results in the L2RPN WCCI 2020 environment. Even in this large-scale
                        power grid environment, the OC-LSTM algorithm still demonstrates better stability,
                        significantly outperforming PPO and DDDQN, and is able to maintain a long survival
                        step. This demonstrates that the OC-LSTM algorithm is able to maintain stable operation
                        of the grid by effectively controlling the topology of the grid even in the face of
                        a more complex power system.
                     
                   
                  
                        3.3.2 Testing
                     The comprehensive analysis of the experimental results in small-scale (IEEE 5-Bus),
                        medium-scale (IEEE 14-Bus), and large-scale (L2RPN WCCI 2020) environments shows that
                        the OC-LSTM algorithm still maintains a good performance with the increase of the
                        state space and action space of the power system. Compared with the baseline algorithms
                        PPO and DDDQN, the OC-LSTM shows superior stability and control  capability in different
                        scales of grid topology control tasks. This further demonstrates the effectiveness
                        of introducing a hierarchical policy architecture. Table 2 presents the mean and standard deviation of the survival steps for the agent in three
                        different power system environments after the algorithm converges. Each survival step
                        corresponds to 5 minutes, with a total of 288 survival steps representing one day.
                        These data reflect the agent's ability to maintain stable performance in three different
                        scales of power systems.
                     
                     
                           
                           
Table 2 The training results of agents in three power system environments. The data
                              includes mean and standard deviation, representing the training results of the agents
                              in different environments. Each step represents 5 minutes, with 288 steps in a day.
                           
                        
                        
                              
                                 
                                    |  | OC-LSTM (Unit: step) | PPO (Unit: step) | DDDQN (Unit: step) | 
                              
                                    | IEEE 5-Bus | 894 ±43 | 803± 71 | 224 ±23 | 
                              
                                    | IEEE 14-Bus | 728 ±98 | 378 ±60 | 170 ±59 | 
                              
                                    | L2RPN WCCI 2020 | 825± 61 | 77± 12 | 265 ±28 | 
                           
                        
                      
                     To further validate the improved feature extraction capability of the OC algorithm
                        combined with LSTM in power systems, we performed ablation experiments in   the L2RPN
                        WCCI 2020 environment. Fig. 5 demonstrates the performance comparison between the OC-LSTM algorithm and the standard
                        OC algorithm, and the results show that the OC-LSTM algorithm has a significantly
                        higher number of survivor steps, even though its convergence speed is slightly slower
                        than that of the OC algorithm using a linear layer. This result demonstrates the advantage
                        of LSTM in extracting power system temporal features, which further enhances the control
                        ability of agent in complex grid environments.
                     
                     We randomly selected 5 scenarios for evaluation in three different scales of power
                        system environments, utilizing the optimal model and comparing the OC-LSTM algorithm
                        with the $Do Nothin g Action$. The $Do Nothin g Action$ represents the scenario where
                        no action is taken in the power system throughout, showing the maximum survival steps
                        of the power system without any intervention. By comparing with $Do Nothin g Action$,
                        the effectiveness of the topology control learned by the model can be tested.
                     
                     
                           
                           
Fig. 5. Ablation experiment: L2RPN WCCI 2020 environment.
                         
                     As shown in Fig. 6(a), in the IEEE 5-bus environment, the OC-LSTM algorithm consistently enables the power
                        system to operate stably for over 60 hours across all scenarios.
                     
                     Fig. 6(b) presents the evaluation results in the IEEE 14-bus environment. The OC-LSTM algorithm
                        significantly outperforms the $Do Nothing Action$ in all scenarios, helping the system
                        operate for more than 60 hours.
                     
                     Similarly, in the L2RPN WCCI 2020 environment (as shown in Fig. 6(c)), the OC-LSTM algorithm performs excellently, maintaining stable operation of the
                        large power system for over 60 hours, clearly outperforming the $Do Nothin g Action$.
                     
                     
                           
                           
Fig. 6. Testing results of OC-LSTM algorithm.
                         
                   
                
             
            
                  4. Conclusion	
               This research proposes a novel OC-LSTM algorithm for topology control problems in
                  power systems. The algorithm achieves stable control for up to 60 hours in power system
                  environments such as IEEE 5-Bus, IEEE 14-Bus, and L2RPN WCCI 2020, and all operations
                  are performed without the intervention of human experts. Compared with the baseline
                  Deep Reinforcement Learning (DRL) algorithm, the OC-LSTM algorithm is more adapted
                  to power system applications in high-dimensional state and action spaces by introducing
                  a hierarchical policy and extracting temporal features with the LSTM. The LSTM efficiently
                  captures the dynamics of the power system and extracts the key temporal features,
                  which enables the agent to better comprehend the environmental information and make
                  accurate decisions. This combination not only enhances the performance of OC-LSTM
                  in complex power system environments, but also significantly improves its application
                  in power systems.
               
               Future research can focus on further optimizing the intra-option policy in the OC-LSTM
                  algorithm to enhance its control efficiency in environments where large-scale renewable
                  energy and power storage devices are deployed. This will help to increase the value
                  of OC-LSTM applications in real power systems and provide stronger support for sustainable
                  energy management.
               
               
               Acknowledgements
               
               This research was supported in part by the KEPCO under the project entitled by “Development
               of GW class voltage sourced DC linkage technology for improved interconnectivity and
               carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in
               part by  the Institute of Information & communications Technology Planning and Evaluation
               (IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).
               
               
               			
            
 
          
         
            
                  Acknowledgements
               
                  This research was supported in part by the KEPCO under the project entitled by “Development
                  of GW class voltage sourced DC linkage technology for improved interconnectivity and
                  carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in
                  part by  the Institute of Information & communications Technology Planning and Evaluation
                  (IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).
                  
                  			
               
             
            
                  
                     References
                  
                     
                        
                        Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for power system applications:
                           An overview,” CSEE Journal of Power and Energy Systems, vol. 6, no. 1, pp. 213–225,
                           2019. DOI:10.17775/CSEEJPES.2019.00920

 
                     
                        
                        A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly, A. O’Sullivan, J. Viebahn, M. Awad,
                           I. Guyon, P. Panciatici and C. Romero, “Learning to run a power network challenge:
                           a retrospective analysis,” in NeurIPS 2020 Competition and Demonstration Track. PMLR,
                           pp. 112–132, 2021. https://proceedings.mlr.press/v133/marot21a.html

 
                     
                        
                        D. Ernst, M. Glavic and L. Wehenkel, “Power systems stability control: reinforcement
                           learning framework,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 427–435,
                           2004. DOI:10.1109/TPWRS.2003.821457

 
                     
                        
                        W. Cai, H. N. Esfahani, A. B. Kordabad and S. Gros, “Optimal management of the peak
                           power penalty for smart grids using mpc-based reinforcement learning,” in 2021 60th
                           IEEE Conference on Decision and Control (CDC), pp. 6365–6370, 2021. DOI:10.1109/CDC45484.2021.9683333

 
                     
                        
                        M. Kamel, R. Dai, Y. Wang, F. Li and G. Liu, “Data-driven and model-based hybrid reinforcement
                           learning to reduce stress on power systems branches,” CSEE Journal of Power and Energy
                           Systems, vol. 7, no. 3, pp. 433–442, 2021. DOI:10.17775/CSEEJPES.2020.04570

 
                     
                        
                        J. Li, S. Chen, X. Wang and T. Pu, “Load shedding control strategy in power grid emergency
                           state based on deep reinforcement learning,” CSEE Journal of Power and Energy Systems,
                           vol. 8, no. 4, pp. 1175–1182, 2021. DOI:10.17775/CSEEJPES.2020.06120

 
                     
                        
                        H. Yousuf, A. Y. Zainal, M. Alshurideh and S. A. Salloum, “Artificial intelligence
                           models in power system analysis,” in Artificial Intelligence for Sustainable Development:
                           Theory, Practice and Future Applications. Springer, vol. 912, pp. 231–242, 2020. DOI:10.1007/978-3-030-51920-9_12

 
                     
                        
                        C. Zhao, U. Topcu, N. Li and S. Low, “Design and stability of load-side primary frequency
                           control in power systems,” IEEE Transactions on Automatic Control, vol. 59, no. 5,
                           pp. 1177–1189, 2014. DOI:10.1109/TAC.2014.2298140

 
                     
                        
                        Y. Zhang, X. Shi, H. Zhang, Y. Cao and V. Terzija, “Review on deep learning applications
                           in frequency analysis and control of modern power system,” International Journal of
                           Electrical Power & Energy Systems, vol. 136, no. 107744, pp. 1–18, 2022. DOI:10.1016/j.ijepes.2021.107744

 
                     
                        
                        A. K. Ozcanli, F. Yaprakdal and M. Baysal, “Deep learning methods and applications
                           for electrical power systems: A comprehensive review,” International Journal of Energy
                           Research, vol. 44, no. 9, pp. 7136–7157, 2020. DOI:10.1002/er.5331

 
                     
                        
                        D. Yoon, S. Hong, B. J. Lee and K. E. Kim, “Winning the l2rpn challenge: Power grid
                           management via semi-markov afterstate actor-critic,” in 9th International Conference
                           on Learning Representations, ICLR 2021, pp. 1–12, 2021. https://openreview.net/forum?id=LmUJqB1Cz8

 
                     
                        
                        M. Subramanian, J. Viebahn, S. H. Tindemans, B. Donnot and A. Marot, “Exploring grid
                           topology reconfiguration using a simple deep reinforcement learning approach,” in
                           2021 IEEE Madrid PowerTech, pp. 1–6, 2021. DOI:10.1109/PowerTech46648.2021.9494879

 
                     
                        
                        I. Damjanovi´c, I. Pavi´c, M. Brˇci´c and R. Jerˇci´c, “High performance computing
                           reinforcement learning framework for power system control,” in 2023 IEEE Power & Energy
                           Society Innovative Smart Grid Technologies Conference (ISGT). IEEE, pp. 1–5, 2023.
                           DOI:10.1109/ISGT51731.2023.10066416

 
                     
                        
                        J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization
                           algorithms,” arXiv preprint arXiv:1707.06347, 2017. DOI:10.48550/arXiv.1707.06347

 
                     
                        
                        V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
                           M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through
                           deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:10.1038/nature14236

 
                     
                        
                        T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, “Soft actor-critic: Off-policy maximum
                           entropy deep reinforcement learning with a stochastic actor,” in International Conference
                           on Machine Learning. PMLR, pp. 1861–1870, 2018,.

 
                     
                        
                        Y. Liu, D. Zhang and H. B. Gooi, “Optimization strategy based on deep reinforcement
                           learning for home energy management,” CSEE Journal of Power and Energy Systems, vol.
                           6, no. 3, pp. 572–582, 2020. DOI:10.17775/CSEEJPES.2019.02890

 
                     
                        
                        Y. Zhou, B. Zhang, C. Xu, T. Lan, R. Diao, D. Shi, Z. Wang and W. -J. Lee, “A data-driven
                           method for fast ac optimal power flow solutions via deep reinforcement learning,”
                           Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128–1139, 2020.
                           DOI:10.35833/MPCE.2020.000522

 
                     
                        
                        B. Zhang, W. Hu, D. Cao, T. Li, Z. Zhang, Z. Chen and F. Blaabjerg, “Soft actor-critic-based
                           multi-objective optimized energy conversion and management strategy for integrated
                           energy systems with renewable energy,” Energy Conversion and Management, vol. 243,
                           no. 114381, pp. 1–15, 2021. DOI:10.1016/j.enconman.2021.114381

 
                     
                        
                        G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba,
                           “Openai gym,” arXiv preprint arXiv:1606.01540, 2016. DOI:10.48550/arXiv.1606.01540

 
                     
                        
                        M. Lehna, J. Viebahn, A. Marot, S. Tomforde and C. Scholz, “Managing power grids through
                           topology actions: A comparative study between advanced rule-based and reinforcement
                           learning agents,” Energy and AI, vol. 14, no. 100276, pp. 1–11, 2023. DOI:10.1016/j.egyai.2023.100276

 
                     
                        
                        I. Damjanovi´c, I. Pavi´c, M. Puljiz and M. Brcic, “Deep reinforcement learning-based
                           approach for autonomous power flow control using only topology changes,” Energies,
                           vol. 15, no. 19, pp. 1–16, 2022. DOI:10.3390/en15196920

 
                     
                        
                        X. Han, Y. Hao, Z. Chong, S. Ma and C. Mu, “Deep reinforcement learning based autonomous
                           control approach for power system topology optimization,” in 2022 41st Chinese Control
                           Conference (CCC), pp. 6041–6046, 2022. DOI:10.23919/CCC55666.2022.9902073

 
                     
                        
                        V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller,
                           “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602,
                           2013.

 
                     
                        
                        K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink and J. Schmidhuber, “Lstm:
                           A search space odyssey,” IEEE Transactions on Neural Networks and Learning Systems,
                           vol. 28, no. 10, pp. 2222–2232, 2016. DOI:10.1109/TNNLS.2016.2582924

 
                     
                        
                        Z. C. Lipton, J. Berkowitz and C. Elkan, “A critical review of recurrent neural networks
                           for sequence learning,” arXiv preprint arXiv:1506.00019, 2015. DOI:10.48550/arXiv.1506.00019

 
                     
                        
                        P. -L. Bacon, J. Harb and D. Precup, “The option-critic architecture,” in Proceedings
                           of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, pp. 1726–1734,
                           2017. DOI:10.1609/aaai.v31i1.10916

 
                     
                        
                        B. Donnot, “Grid2op- A testbed platform to model sequential decision making in power
                           systems,” 2020. https://GitHub.com/rte-france/grid2op

 
                   
                
             
            저자소개
            
            Chen Wang received the B.S. degree in electronics and computer engineering and the
               M.S. degree in electronic computer engineering from Chonnam National University, South
               Korea, in 2020 and 2022. He is currently pursuing the Ph.D. degree in electrical engineering
               at Hanyang University, Seoul, South Korea. His research interests include smart grid,
               deep reinforcement learning, and their applications.
            
            
            Haotian Zhang received the B.S. degree in mechanical engineering from Qingdao University
               of Science and Technology, Qingdao, China, and Hanyang University, Ansan, South Korea,
               in 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang
               University, Seoul, South Korea. His research interests include optimal control, smart
               grid, deep reinforcement learning, and their applications.
            
            
            Minju Lee received the B.S. degree in climate and energy systems engineering from
               Ewha Womans University, Seoul, South Korea, in 2022, where she is currently pursuing
               the degree with the Department of Climate and Energy Systems Engineering. Her research
               interests include short-term wind power forecasting and the probabilistic estimation
               of transmission congestion for grid integration.
            
            
            Myoung Hoon Lee received the B.S. degree in electrical engineering from Kyungpook
               National University, Daegu, South Korea, in 2016, and the Ph.D. degree in electrical
               engineering from the Ulsan National Institute of Science and Technology, Ulsan, South
               Korea, in 2021. From 2021 to 2023, he was a Postdoctoral Research Fellow with the
               Research Institute of Electrical and Computer Engineering, Hanyang University, Seoul,
               South Korea. He is currently an Assistant Professor with the Department of Electrical
               Engineering, Incheon National University, Incheon, South Korea. His research interests
               include decentralized optimal control, mean field games, deep reinforcement learning,
               and their applications.
            
            
            Jun Moon is currently an Associate Professor in the Department of Electrical Engineering
               at Hanyang University, Seoul, South Korea. He received the B.S. degree in electrical
               and computer engineering, and the M.S. degree in electrical engineering from Hanyang
               University, Seoul, South Korea, in 2006 and 2008, respectively. He received the Ph.D.
               degree in electrical and computer engineering from University of Illinois at Urbana-Champaign,
               USA, in 2015. From 2008 to 2011, he was a researcher at Agency for Defense Development
               (ADD) in South Korea. From 2016 to 2019, he was with the School of Electrical and
               Computer Engineering, Ulsan National Institute of Science and Technology (UNIST),
               South Korea, as an assistant professor. From 2019 to 2020, he was with the School
               of Electrical and Computer Engineering, University of Seoul, South Korea, as an associate
               professor. He is a recipient of the Fulbright Graduate Study Award 2011. His research
               interests include stochastic optimal control and filtering, reinforcement learning,
               data-driven control, distributed control, networked control systems, and mean field
               games.