• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of Korea.)
  2. (KEPCO Research Institute, Daejeon Metropolitan City 34056, Republic of Korea.)



Deep reinforcement learning, option-critic framework, topology control, smart grid.

1. Introduction

As renewable energy sources like wind and solar are increasingly integrated into power systems, ensuring efficient and secure power transmission has become more challenging [1,2]. In this context, traditional model-based control and management methods for power systems are beginning to show their limitations. Recently, with the rise of neural networks (NN), deep reinforcement learning (DRL) control methods have gained significant attention [3]. Several studies have explored the use of DRL models for optimizing and controlling the power grid [4-6]. Specifically, [4] investigated a multi-agent residential smart grid and proposed an optimal policy for minimizing economic costs using parametric model predictive control (MPC) under a deterministic policy gradient (DPG)-based reinforcement learning algorithm. [5] applied a model-based and data-driven DRL algorithm to resolve line overload issues in power systems, while [6] introduced an emergency load shedding method based on a deep deterministic policy gradient (DDPG) algorithm to enhance stable power system operations through autonomous voltage control.

As power systems continue to develop and expand, their size and complexity are showing a steady increase [7]. In this context, conventional model-based automatic control methods encounter increasing challenges in meeting the demands of power grid operations. Traditional control approaches primarily focus on regulating generators and loads. However, as the power network expands, these methods may lack flexibility, especially when dealing with the grid's variability, efficient energy integration, and system security concerns [8-10]. In light of these challenges, topology control methods have gained considerable attention. Compared to other control strategies, topology control offers a more cost-effective approach to managing the power grid. This method reconfigures the grid structure by adjusting the connections of power lines and the distribution of buses, which effectively reduces congestion and

enhances power transmission efficiency. A distinct advantage of topology control is its ability to quickly respond to changes in grid topology, helping to lower the risk of system failures and improve the overall stability and robustness of the power system [11,12]. This method not only enhances the flexibility and sustainability of power systems but also offers opportunities to improve system performance. As a result, topology control plays a crucial role in modernizing and optimizing power systems.

1.1 Related works

Classic control methods in power systems, such as MPC and proportional integral derivative (PID) control, are highly dependent on a detailed dynamic model of the system, which needs to be accurately represented by an accurate mathematical model for optimal decision making. As a result, building such models in complex or highly dynamic environments is often a huge challenge. In contrast, DRL, as a data-driven approach can effectively circumvent the need for detailed mathematical models and efficiently cope with the complexity and uncertainty of power system environments by utilizing a large amount of data and an iterative trial-and-error approach to find the optimal control policy. In recent years, the rapid advancement of DRL has led to the widespread application of various baseline DRL algorithms in power system control. These algorithms encompass deep Q-network (DQN) [15], proximal policy optimization (PPO) [14], and soft actor-critic (SAC) [16], all of which have demonstrated outstanding performance across various power system control scenarios. Specifically, in [17], the authors utilized the DQN and double deep Q-network (DDQN) algorithms for scheduling household appliances to determine the optimal energy scheduling policy. [18] introduces a novel method for addressing the alternating current optimal power flow problem by employing an advanced PPO algorithm, which aids grid operators in rapidly developing accurate control policies to ensure system security and economic efficiency. To address uncertainties such as the intermittency of wind energy and load flexibility, [19] applies the SAC algorithm for energy dispatch optimization.

Additionally, [2] introduced ‘Grid2Op’, a power system simulation platform designed to address control challenges in power systems using artificial intelligence. ‘Grid2Op’ is an open-source framework compatible with the OpenAI Gym [20], offering a convenient tool for building and controlling power systems with reinforcement learning algorithms. Specifically, ‘Grid2Op’ integrates seamlessly with DRL algorithms to enable effective power system management by intelligently regulating power line switching states or bus distribution. A key advantage of using the ‘Grid2Op’ platform is that it allows experiments with real power system data, making it possible to train and

simulate using any DRL algorithm. As a result, ‘Grid2Op’ has become one of the leading simulation platforms for power systems, offering scenario-based simulations that provide higher realism and credibility for experimental results [2,21].

Among the studies on power system topology control using DRL in the “Grid2Op” platform, the authors in [12] employed the cross-entropy method (CEM) reinforcement learning algorithm to manage power flow through topology switching actions, analyzing the variability and types of topologies. In [22], the authors applied the dueling double deep Q-network algorithm combined with a prioritized replay mechanism to control topology changes in power systems, achieving notable results. Additionally, [23] proposed a method that combines imitation learning with the SAC deep reinforcement learning algorithm to enable stable autonomous control in the IEEE 118-Bus power system, demonstrating the method's effectiveness and robustness in topology optimization.

1.2 Main Contributions

In this paper, an option-critic based DRL method for topology control policy in power systems is proposed, with the main contributions outlined as follows:

1. We integrate the option-critic (OC) algorithm with long short-term memory (LSTM) neural networks to capture time-series features in high-dimensional power system environments. The LSTM networks help model these features, while the option-based DRL algorithm decomposes the large and complex action space into executable options, effectively reducing the dimensionality of the action space in the power system.

2. We apply the proposed OC-LSTM algorithm to the topology control policy in the power system and compare it with a baseline DRL algorithm. Experimental results demonstrate that the OC-LSTM method enables stable operation of the IEEE 5-Bus, IEEE 14-Bus, and L2RPN WCCI 2020 power system environments for 60 hours, without requiring any human expert intervention.

To evaluate the performance of the OC-LSTM, we conducted simulation experiments on the ‘Grid2Op’ platform, using three power system simulation environments of varying scales. We then compared our algorithm with baseline DRL algorithms, namely dueling double deep Q-network (DDDQN) [13] and proximal policy optimization (PPO) [14], to validate the effectiveness of OC-LSTM. Furthermore, to assess the OC-LSTM's effectiveness in addressing the power system topology control problem, we compared the optimal model of our algorithm against the $Do Nothing Action$. The results demonstrate that OC-LSTM can maintain stable operation for a longer period than $Do Nothing Action$, without the need for human expert intervention.

The paper is organized as follows: In Section 2, we introduce the OC framework and the LSTM network, followed by the proposed OC with LSTM algorithm (OC-LSTM). Section 3 describes the implementation of power system topology control simulation experiments and validates the effectiveness of the proposed methodology through simulation results. Finally, we conclude the paper in Section 4.

2. Option-Based Reinforcement Learning with Long Short-Term Memory

In this section, we introduce the Option-Critic (OC) framework in detail, which is followed by the core constituent units of the Long Short-Term Memory (LSTM) neural networks. By combining these two advantages, we propose the OC-LSTM algorithm, which combines policy hierarchy with temporal feature extraction to effectively improve the stability and performance of power system topology control.

2.1 Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) can be described as a Markov Decision Process (MDP) formalized as a 5-tuple $<S,\: A,\: P,\: R,\: \gamma >$, where $S$ denotes the state space, $A$ denotes the action space, and $P$ denotes the probability of state transfer, defined as the probability of transferring to the next state after taking an action in the current state; $R$ is a reward function representing the feedback provided by the environment at each moment; $\gamma$ is a discount factor to measure the importance of future rewards.

At each time step $t$, the policy function $\pi_{t}$ determines the action $a_{t}$ chosen by the intelligent in state $s_{t}$. The agent interacts with the environment through the policy $\pi$. After executing an action, the environment updates the state and provides the corresponding reward feedback $r_{t}$. The main goal of DRL is to learn the optimal policy function by maximizing the cumulative reward. For this, the state-value function $V_{\pi}$ and the state-action-value function $Q_{\pi}$ are introduced to evaluate the policy quality and predict future rewards.

Specifically, the state-value function $V_{\pi}$ represents the expected cumulative reward starting from state $s$ and following policy $\pi$ :

(1)
$V_{\pi}(s)= E_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}| s_{0}=s].$

In contrast, the state-action-value function $Q_{\pi}$ extends this concept by considering the expected cumulative reward from taking a specific action $a$ at state $s$ and then following the policy $\pi$.

(2)
$Q_{\pi}(s,\: a)= E_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}| s_{0}=s,\: a_{0}=a].$

The difference is that $V_{\pi}$ evaluates the value of being in a specific state, while $Q_{\pi}$ evaluates the value of taking an action in a specific state.

2.2 Option-Critic Framework

The OC framework [22] is a policy gradient-based approach for simultaneously learning the intra-option policy and the option termination function. The method allows agent to employ a hierarchical decision structure in task learning and execution, enabling them to make more accurate decisions over longer time horizons and thus be more adaptable to complex power system environments. A Markov option $\omega\inΩ$ can be represented as a tuple $(I_{\omega},\: \pi_{\omega},\: \beta_{\omega})$, where $I_{\omega}$ is the initialization set, $\pi_{\omega}$ is the intra-option policy, and $\left.\beta_{\omega}:S\right.→[0,\: 1]$ defines the termination probability of the option at each state.

The OC framework selects an option based on the option policy $\pi_{Ω}$ and executes its policy within that option $\pi_{\omega}$ until the termination condition is met. Once the current option terminates, the system selects the next option and continues the process. The internal strategy $\pi_{\omega ,\: \theta}$ of an option is parametrically represented by the parameter $\theta$, while the termination condition $\pi_{\omega ,\: v}$ of an option is parametrically represented by the parameter $v$. The state-option-action value function $Q_{U}(s_{t},\: w,\: a)$ represents the expected cumulative return when option $\omega$ is selected in state st and action $a$ is executed. This function can be expressed as the following equation:

(3)
$Q_{U}(s_{t},\: w,\: a)=r(s_{t},\: a)+\gamma\sum_{s_{t+1}}P(s_{t+1}| s_{t},\: a)U(s_{t+1},\: w).$

In (3), note that the value of the state $s_{t+1}$ reached through the option $\omega$ can be expressed as follows:

(4)
\begin{align*} U(s_{t+1},\: \omega)=(1-\beta_{\omega ,\: v}(s_{t+1})Q_{\omega}(s_{t+1},\: \omega)\\ +\beta_{\omega ,\: v}(s_{t+1})V_{\omega}(s_{t+1}),\: \end{align*}

where $Q_{Ω}(s_{t},\: w)=\sum_{a}\pi_{\omega ,\:\theta}(a,\: s_{t})Q_{U}(s_{t},\: \omega ,\: a)$ represents the option value function, and $V_{Ω}(s_{t})=\sum_{\omega}\pi_{\omega}(\omega ,\: s_{t})Q_{Ω}(s_{t},\: \omega)$ represents the option-level state value function. In this case, the option $\omega$ terminates with probability $\beta_{\omega ,\: v}(s_{t+1})$, leading to the selection of a new option, or continues with the current option with probability $1-\beta_{\omega ,\: v}(s_{t+1})$.

The intra-option policies and option termination function can be learned by using the policy gradient theorem to maximize the expected discounted return . In the case where the initial condition is $(s_{0},\: \omega_{0})$, the gradient of the objective function for the intra-option policy parameter $\theta$ can be expressed as:

(5)
\begin{align*} \dfrac{\partial Q_{Ω}(s_{0},\: \omega_{0})}{\partial\theta}=\sum_{s_{t ,\: \omega}}\mu_{Ω}(s,\: \omega | s_{0},\: \omega_{0})\\\times\sum_{a}\dfrac{\partial\pi_{\omega ,\: \theta}(a | s_{t})}{\partial\theta}Q_{U}(s_{t},\: \omega_{t},\: a). \end{align*}

Similarly, the gradient with respect to option termination parameters $v$ with initial condition $(s_{1},\: \omega_{0})$ is:

(6)
\begin{align*} \dfrac{\partial U(\omega_{0},\: s_{1})}{\partial v}= -\sum_{\omega ,\: s_{t+1}}\mu_{Ω}(s_{t+1},\: \omega | s_{1},\: \omega_{0})\\\times\dfrac{\partial\beta_{\omega ,\: v}(s_{t+1})}{\partial v}A_{Ω}(s_{t+1},\: \omega),\: \end{align*}
where $A_{Ω}(s,\: \omega)=Q_{Ω}(s,\: \omega)-V_{Ω}(s)$ is the advantage of choosing option $\omega$ in state $s$.

2.3 Long-Short Term Memory

Long Short-Term Memory networks (LSTM) [25] are a variant of Recurrent Neural Networks (RNN) [26] that effectively capture feature relationships in time series data while addressing the issues of vanishing and exploding gradients that RNN encounter when processing long sequences. One of the main challenges in applying DRL to power system control is finding the optimal control strategy within the vast action and state spaces. To overcome this challenge, we introduce LSTM networks, which can capture time-related information relevant to the target task in high-dimensional state spaces and prevent issues such as gradient vanishing during training.

In the LSTM structure, the forget gate $f_{t}$, input gate $i_{t}$, input node $g_{t}$, and output gate $o_{t}$ are defined at time $t$, with weights and bias values denoted by $W$ and $b$, respectively. The sigmoid activation function and the tanh activation function are represented by $\sigma$ and $\phi$. The following describes the input-output mapping relationships of each node, with the specific workflow of the LSTM illustrated in Fig. 1.

Fig. 1. LSTM network structure.

../../Resources/kiee/KIEE.2025.74.6.1030/fig1.png

First, the forget gate $f_{t}$ uses the sigmoid activation function to determine the information to discard from the input $x_{t}$ and the previous output $h_{t-1}$, represented as:

(7)
$f_{t}=\sigma(W_{f}[x_{t,\:}h_{t-1}]+b_{f}).$

Next, the input gate $i_{t}$​ similarly takes the input information $x_{t}$ and the intermediate output $h_{t-1}$ from the previous time step, applying the sigmoid activation function to decide what information to store. This is combined with the next input node $g_{t}$ to determine what to retain, expressed as:

(8)
$i_{t}=\sigma(W_{i}[x_{t,\:}h_{t-1}]+b_{i}),\:$ $g_{t}=\phi(W_{g}[x_{t,\:}h_{t-1}]+b_{g}).$

Subsequently, the internal state $s_{t-1}$​ is updated to the new state $s_{t}$ using the input gate $i_{t}$ and the forget gate $f_{t}$, as shown in the following equation:

(9)
$s_{t}=f_{t}s_{t-1}+g_{t}i_{t}.$

Finally, the input information $x_{t}$ and the intermediate output $h_{t-1}$​ serve as inputs to the output gate $o_{t}$, which is processed through a sigmoid activation function. The final output $h_{t}$ is generated based on the internal state $s_{t}$, represented by:

(10)
$o_{t}=\sigma(W_{o}[x_{t,\:}h_{t-1}]+b_{o}).$ $h_{t}=\phi(s_{t})o_{t}.$

The LSTM model effectively maintains the dependencies between information by storing node information in the internal state through the cooperation of internal nodes.

2.4 OC-LSTM : Option-Critic with Long Short-Term Memory

The OC-LSTM algorithm aims to solve the topology control problem in power systems by combining the LSTM and Option-Critic frameworks, and its algorithm structure is shown in Fig. 2.

First, OC-LSTM learns and processes state inputs in the power system through LSTM networks. LSTM can capture a large amount of key state information in the power system that contains temporal features, such as power generation, load profiles, and changes in the grid topology, to help the agent better understand the dynamic features of the system.

Fig. 2. The overall OC-LSTM framework.

../../Resources/kiee/KIEE.2025.74.6.1030/fig2.png

Next, the extracted temporal state information is fed into the OC framework. In this framework, the policy over option selects an appropriate option based on the current state.

Then, the intra-option policy determines the specific action to be executed and uses the termination function to decide whether to continue with the current option or terminate it and select a new option based on the policy over options.

In this process, the critic evaluates the performance of the intra-option policy and the termination function and updates the intra-option policy and the termination function using the policy gradient to optimize the behavior of the agent.

By combining the ability of LSTM to extract temporal state features with the OC hierarchical policy framework, the OC-LSTM algorithm effectively copes with the challenges posed by the increase in state space and action dimensions in power system topology control problems. LSTM is able to capture the dynamic changes of key state information in the power system, thus providing rich temporal feature information for decision making. Meanwhile, the OC framework introduces a hierarchical decision structure, which enables the agent to select and execute control policies more efficiently in the face of complex action spaces. This combination enhances the stability of the agent in changing environments.

3. Topology Control Problem and Simulation

Section 3 defines the power system topology control problem based on the Markov decision process and details the simulation environment used in the experiments. Simulation results verify the proposed method's performance and stability in power system topology control.

3.1 Simulation environments

We will conduct simulations in three environments: IEEE 5-Bus, IEEE 14-Bus, and L2RPN WCCI 2020, as shown in Fig. 3. These power system networks consist of generators (such as thermal power plants and renewable energy facilities), loads (such as households and factories), and power lines. Generators and loads are connected through substations, and different substations are interconnected by power lines.

Fig. 3(a) shows the structure of the IEEE 5-Bus simulation environment, includes 8 power lines, 3 loads, 2 generators, and 5 substations, with 20 scenarios. Each scenario has a time interval of 5 minutes, allowing a maximum of 288 survival steps per day.

Fig. 3. Power system environments.

../../Resources/kiee/KIEE.2025.74.6.1030/fig3.png

Fig. 3(b) presents the IEEE 14-Bus simulation environment, which includes 20 power lines, 11 loads, 6 generators, and 14 substations. The IEEE 14-Bus simulation environment contains 1004 scenarios, each simulating the internal changes of the French power grid over 28 consecutive days. Similar to the IEEE 5-Bus environment, this simulation environment allows a maximum of 288 survival steps per day. Each step records information on grid topology and generator output, with details provided in the observation space section. Consecutive steps capture the dynamic changes within the power system.

Fig. 3(c) illustrates the L2RPN WCCI 2020 environment, which is based on part of the IEEE 118-Bus network and includes 59 power lines, 37 loads, 22 generators, and 36 substations. The simulation dataset for the L2RPN WCCI 2020 environment covers 240 years of time data, with the same number of survival steps in its simulation setup as the two aforementioned environments.

All the above power system simulations are modeled on the “Grid2Op” platform to ensure standardized and consistent experiments [2,28].

3.2 Define topology control with MDP

Observation space : We set up a unified observation space for the power system environment, including v_or, v_ex, a_or, a_ex, ρ, and line_status. where v_or and v_ex denote the voltage values on the buses connected to the start and end points of each power line, a_or and a_ex denote the currents on the buses connected to the start and end points of each power line, ρ denotes the thermal capacity of each power line, and line_status denotes the connection status of each power line. Such a design enables the agent to fully capture the critical dynamics of the power system currents, thus providing an accurate decision basis for the optimal topology control problem. The observation space dimensions of the three power system environments are shown in Table 1.

Table 1 Dimensions of observation and action space in three power system environments.

Observation Space

Action

Space

IEEE 5-Bus

69

9

IEEE 14-Bus

177

21

L2RPN WCCI 2020

531

60

Action space : We define the action space of the power system environment as change_line_status, which represents changing the status of a power line (connected/disconnected), and this action belongs to the discrete action type. This design is consistent with the goal of this study, which is to achieve stable power system operation through topology control of power lines. The action dimension consists of $A=(A_{Do Nothin g},\: A_{{l}in{e}})$, where $A_{{l}in{e}}$ denotes the number of power lines that can be connected or disconnected, and $A_{Do Nothin g}$ denotes an action that does not perform any operation. The action space dimensions in the three power system simulation environments are shown in Table 1.

Reward function : We defined the reward function for the power system environment (as shown in Equation (12)), which consists of two parts: feedback on the agent's behavior and the utilization rate of line capacity. Specifically, if the agent's policy causes the power system environment to terminate prematurely or if illegal actions are taken, the agent will receive a minimum reward of 0. On the other hand, the reward calculation for line capacity utilization is shown in Equation (11), where $\rho_{i}$ denotes the utilization rate of each power line ($\rho_{i}\in(0,\: 1)$) and $line_{i}$ denotes the connection status of each power line. This equation reflects the capacity utilization of the power lines. When the utilization of all lines is close to the maximum capacity, the reward value is close to 0. Conversely, the lower the line utilization, the higher the reward value. This reward function is designed to allow the agent to avoid overloading the power lines, thus maintaining the stability of the power system.

(11)
$x =\dfrac{\sum_{i=1}^{N}line_{i}\bullet(1-\rho_{{i}})}{\sum_{i=1}^{N}line_{i}},\:$
(12)
$r_{t}=\begin{cases} x,\: &{if}{s}\notin{s}^{{error}}{and}{a}\notin{A}^{illegal \;or\; ambiguous},\: \\ 0,\:&{if}{s}\in{s}^{{error}}{or}{a}\in{A}^{illegal \;or\; ambiguous},\:\end{cases}$

where $x\in(0,\: 1)$, $A^{illegal \;or\; ambiguous}$ represents the set of illegal or ambiguous actions as defined by the Grid2Op simulation platform and $S^{error}$ ​represents the premature termination of the environment.

To ensure that the agent learns to control the power grid safely and effectively, it operates within a strictly constrained environment to maintain the grid's normal operation. These constraints correspond to real-world operational limitations, and violations will result in the termination of the environment. The constraints include:

1. Disconnecting too many power lines, causing an inability to meet load demand.

2. Interrupting the connection between generators and the grid.

3. Creating grid islands due to topological changes.

3.3 Simulation results

Fig. 4 presents the training results across three different power system environments, where we compare the OC-LSTM algorithm with two typical DRL baseline algorithms: PPO and DDDQN. DDDQN is an improved version of the DQN algorithm, which significantly outperforms the original DQN in terms of performance. On the other hand, PPO is a widely used policy optimization algorithm that maintains training stability and sampling efficiency while gradually converging to the optimal policy. Both baseline algorithms are suitable for handling high-dimensional state spaces and discrete action spaces. To ensure the robustness of the results, each algorithm

Fig. 4. Training results of OC-LSTM (ours) and baseline algorithms in (a) IEEE 5-Bus, (b) IEEE 14-Bus and (c) L2RPN WCCI 2020 environments.

../../Resources/kiee/KIEE.2025.74.6.1030/fig4.png

was trained three times. In the training process, we use the survival steps as the evaluation criterion, as this metric can intuitively reflect the agent's control capability. In the simulation, the maximum operation time of the power system is set to 4 days, with an environment sampling interval of 5 minutes. Therefore, the maximum survival steps per episode is 1152. A longer survival step indicates that the agent can maintain stable operation in the power system for a longer time through topology control, reflecting stronger topology control performance. Conversely, poor topology control policies will struggle to maintain the stable operation of the power system, resulting in a lower number of survival steps.

3.3.1 Training

As shown in Fig. 4(a), the OC-LSTM algorithm and the PPO algorithm perform similar survival steps in the IEEE 5-Bus environment. This indicates that both the OC-LSTM algorithm and the PPO algorithm are effective in maintaining the stability of the grid in topology control tasks for small-scale grids. In contrast, the DDDQN algorithm performs poorly in this setting, possibly due to its more limited decision making in topology control, which tends to fall into local optimality.

Fig. 4(b) shows the results in the IEEE 14-Bus environment. In this medium-scale power system, the OC-LSTM algorithm significantly outperforms PPO and DDDQN and is able to maintain stability in the number of surviving steps, reaching more than 720 survival steps, which is equivalent to about 60 hours of operation time. This further validates that the OC-LSTM algorithm can still maintain power system stability even as the state and action spaces of the environment increase.

Fig. 4(c) shows the training results in the L2RPN WCCI 2020 environment. Even in this large-scale power grid environment, the OC-LSTM algorithm still demonstrates better stability, significantly outperforming PPO and DDDQN, and is able to maintain a long survival step. This demonstrates that the OC-LSTM algorithm is able to maintain stable operation of the grid by effectively controlling the topology of the grid even in the face of a more complex power system.

3.3.2 Testing

The comprehensive analysis of the experimental results in small-scale (IEEE 5-Bus), medium-scale (IEEE 14-Bus), and large-scale (L2RPN WCCI 2020) environments shows that the OC-LSTM algorithm still maintains a good performance with the increase of the state space and action space of the power system. Compared with the baseline algorithms PPO and DDDQN, the OC-LSTM shows superior stability and control capability in different scales of grid topology control tasks. This further demonstrates the effectiveness of introducing a hierarchical policy architecture. Table 2 presents the mean and standard deviation of the survival steps for the agent in three different power system environments after the algorithm converges. Each survival step corresponds to 5 minutes, with a total of 288 survival steps representing one day. These data reflect the agent's ability to maintain stable performance in three different scales of power systems.

Table 2 The training results of agents in three power system environments. The data includes mean and standard deviation, representing the training results of the agents in different environments. Each step represents 5 minutes, with 288 steps in a day.

OC-LSTM

(Unit: step)

PPO

(Unit: step)

DDDQN

(Unit: step)

IEEE 5-Bus

894 ±43

803± 71

224 ±23

IEEE 14-Bus

728 ±98

378 ±60

170 ±59

L2RPN WCCI 2020

825± 61

77± 12

265 ±28

To further validate the improved feature extraction capability of the OC algorithm combined with LSTM in power systems, we performed ablation experiments in the L2RPN WCCI 2020 environment. Fig. 5 demonstrates the performance comparison between the OC-LSTM algorithm and the standard OC algorithm, and the results show that the OC-LSTM algorithm has a significantly higher number of survivor steps, even though its convergence speed is slightly slower than that of the OC algorithm using a linear layer. This result demonstrates the advantage of LSTM in extracting power system temporal features, which further enhances the control ability of agent in complex grid environments.

We randomly selected 5 scenarios for evaluation in three different scales of power system environments, utilizing the optimal model and comparing the OC-LSTM algorithm with the $Do Nothin g Action$. The $Do Nothin g Action$ represents the scenario where no action is taken in the power system throughout, showing the maximum survival steps of the power system without any intervention. By comparing with $Do Nothin g Action$, the effectiveness of the topology control learned by the model can be tested.

Fig. 5. Ablation experiment: L2RPN WCCI 2020 environment.

../../Resources/kiee/KIEE.2025.74.6.1030/fig5.png

As shown in Fig. 6(a), in the IEEE 5-bus environment, the OC-LSTM algorithm consistently enables the power system to operate stably for over 60 hours across all scenarios.

Fig. 6(b) presents the evaluation results in the IEEE 14-bus environment. The OC-LSTM algorithm significantly outperforms the $Do Nothing Action$ in all scenarios, helping the system operate for more than 60 hours.

Similarly, in the L2RPN WCCI 2020 environment (as shown in Fig. 6(c)), the OC-LSTM algorithm performs excellently, maintaining stable operation of the large power system for over 60 hours, clearly outperforming the $Do Nothin g Action$.

Fig. 6. Testing results of OC-LSTM algorithm.

../../Resources/kiee/KIEE.2025.74.6.1030/fig6.png

4. Conclusion

This research proposes a novel OC-LSTM algorithm for topology control problems in power systems. The algorithm achieves stable control for up to 60 hours in power system environments such as IEEE 5-Bus, IEEE 14-Bus, and L2RPN WCCI 2020, and all operations are performed without the intervention of human experts. Compared with the baseline Deep Reinforcement Learning (DRL) algorithm, the OC-LSTM algorithm is more adapted to power system applications in high-dimensional state and action spaces by introducing a hierarchical policy and extracting temporal features with the LSTM. The LSTM efficiently captures the dynamics of the power system and extracts the key temporal features, which enables the agent to better comprehend the environmental information and make accurate decisions. This combination not only enhances the performance of OC-LSTM in complex power system environments, but also significantly improves its application in power systems.

Future research can focus on further optimizing the intra-option policy in the OC-LSTM algorithm to enhance its control efficiency in environments where large-scale renewable energy and power storage devices are deployed. This will help to increase the value of OC-LSTM applications in real power systems and provide stronger support for sustainable energy management.

Acknowledgements This research was supported in part by the KEPCO under the project entitled by “Development of GW class voltage sourced DC linkage technology for improved interconnectivity and carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in part by the Institute of Information & communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).

Acknowledgements

This research was supported in part by the KEPCO under the project entitled by “Development of GW class voltage sourced DC linkage technology for improved interconnectivity and carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in part by the Institute of Information & communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).

References

1 
Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for power system applications: An overview,” CSEE Journal of Power and Energy Systems, vol. 6, no. 1, pp. 213–225, 2019. DOI:10.17775/CSEEJPES.2019.00920DOI
2 
A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly, A. O’Sullivan, J. Viebahn, M. Awad, I. Guyon, P. Panciatici and C. Romero, “Learning to run a power network challenge: a retrospective analysis,” in NeurIPS 2020 Competition and Demonstration Track. PMLR, pp. 112–132, 2021. https://proceedings.mlr.press/v133/marot21a.htmlURL
3 
D. Ernst, M. Glavic and L. Wehenkel, “Power systems stability control: reinforcement learning framework,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 427–435, 2004. DOI:10.1109/TPWRS.2003.821457DOI
4 
W. Cai, H. N. Esfahani, A. B. Kordabad and S. Gros, “Optimal management of the peak power penalty for smart grids using mpc-based reinforcement learning,” in 2021 60th IEEE Conference on Decision and Control (CDC), pp. 6365–6370, 2021. DOI:10.1109/CDC45484.2021.9683333DOI
5 
M. Kamel, R. Dai, Y. Wang, F. Li and G. Liu, “Data-driven and model-based hybrid reinforcement learning to reduce stress on power systems branches,” CSEE Journal of Power and Energy Systems, vol. 7, no. 3, pp. 433–442, 2021. DOI:10.17775/CSEEJPES.2020.04570DOI
6 
J. Li, S. Chen, X. Wang and T. Pu, “Load shedding control strategy in power grid emergency state based on deep reinforcement learning,” CSEE Journal of Power and Energy Systems, vol. 8, no. 4, pp. 1175–1182, 2021. DOI:10.17775/CSEEJPES.2020.06120DOI
7 
H. Yousuf, A. Y. Zainal, M. Alshurideh and S. A. Salloum, “Artificial intelligence models in power system analysis,” in Artificial Intelligence for Sustainable Development: Theory, Practice and Future Applications. Springer, vol. 912, pp. 231–242, 2020. DOI:10.1007/978-3-030-51920-9_12DOI
8 
C. Zhao, U. Topcu, N. Li and S. Low, “Design and stability of load-side primary frequency control in power systems,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1177–1189, 2014. DOI:10.1109/TAC.2014.2298140DOI
9 
Y. Zhang, X. Shi, H. Zhang, Y. Cao and V. Terzija, “Review on deep learning applications in frequency analysis and control of modern power system,” International Journal of Electrical Power & Energy Systems, vol. 136, no. 107744, pp. 1–18, 2022. DOI:10.1016/j.ijepes.2021.107744DOI
10 
A. K. Ozcanli, F. Yaprakdal and M. Baysal, “Deep learning methods and applications for electrical power systems: A comprehensive review,” International Journal of Energy Research, vol. 44, no. 9, pp. 7136–7157, 2020. DOI:10.1002/er.5331DOI
11 
D. Yoon, S. Hong, B. J. Lee and K. E. Kim, “Winning the l2rpn challenge: Power grid management via semi-markov afterstate actor-critic,” in 9th International Conference on Learning Representations, ICLR 2021, pp. 1–12, 2021. https://openreview.net/forum?id=LmUJqB1Cz8URL
12 
M. Subramanian, J. Viebahn, S. H. Tindemans, B. Donnot and A. Marot, “Exploring grid topology reconfiguration using a simple deep reinforcement learning approach,” in 2021 IEEE Madrid PowerTech, pp. 1–6, 2021. DOI:10.1109/PowerTech46648.2021.9494879DOI
13 
I. Damjanovi´c, I. Pavi´c, M. Brˇci´c and R. Jerˇci´c, “High performance computing reinforcement learning framework for power system control,” in 2023 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT). IEEE, pp. 1–5, 2023. DOI:10.1109/ISGT51731.2023.10066416DOI
14 
J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. DOI:10.48550/arXiv.1707.06347DOI
15 
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:10.1038/nature14236DOI
16 
T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International Conference on Machine Learning. PMLR, pp. 1861–1870, 2018,.URL
17 
Y. Liu, D. Zhang and H. B. Gooi, “Optimization strategy based on deep reinforcement learning for home energy management,” CSEE Journal of Power and Energy Systems, vol. 6, no. 3, pp. 572–582, 2020. DOI:10.17775/CSEEJPES.2019.02890DOI
18 
Y. Zhou, B. Zhang, C. Xu, T. Lan, R. Diao, D. Shi, Z. Wang and W. -J. Lee, “A data-driven method for fast ac optimal power flow solutions via deep reinforcement learning,” Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128–1139, 2020. DOI:10.35833/MPCE.2020.000522DOI
19 
B. Zhang, W. Hu, D. Cao, T. Li, Z. Zhang, Z. Chen and F. Blaabjerg, “Soft actor-critic-based multi-objective optimized energy conversion and management strategy for integrated energy systems with renewable energy,” Energy Conversion and Management, vol. 243, no. 114381, pp. 1–15, 2021. DOI:10.1016/j.enconman.2021.114381DOI
20 
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016. DOI:10.48550/arXiv.1606.01540DOI
21 
M. Lehna, J. Viebahn, A. Marot, S. Tomforde and C. Scholz, “Managing power grids through topology actions: A comparative study between advanced rule-based and reinforcement learning agents,” Energy and AI, vol. 14, no. 100276, pp. 1–11, 2023. DOI:10.1016/j.egyai.2023.100276DOI
22 
I. Damjanovi´c, I. Pavi´c, M. Puljiz and M. Brcic, “Deep reinforcement learning-based approach for autonomous power flow control using only topology changes,” Energies, vol. 15, no. 19, pp. 1–16, 2022. DOI:10.3390/en15196920DOI
23 
X. Han, Y. Hao, Z. Chong, S. Ma and C. Mu, “Deep reinforcement learning based autonomous control approach for power system topology optimization,” in 2022 41st Chinese Control Conference (CCC), pp. 6041–6046, 2022. DOI:10.23919/CCC55666.2022.9902073DOI
24 
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.URL
25 
K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink and J. Schmidhuber, “Lstm: A search space odyssey,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, 2016. DOI:10.1109/TNNLS.2016.2582924DOI
26 
Z. C. Lipton, J. Berkowitz and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015. DOI:10.48550/arXiv.1506.00019DOI
27 
P. -L. Bacon, J. Harb and D. Precup, “The option-critic architecture,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, pp. 1726–1734, 2017. DOI:10.1609/aaai.v31i1.10916DOI
28 
B. Donnot, “Grid2op- A testbed platform to model sequential decision making in power systems,” 2020. https://GitHub.com/rte-france/grid2opURL

저자소개

왕천(Chen Wang)
../../Resources/kiee/KIEE.2025.74.6.1030/au1.png

Chen Wang received the B.S. degree in electronics and computer engineering and the M.S. degree in electronic computer engineering from Chonnam National University, South Korea, in 2020 and 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang University, Seoul, South Korea. His research interests include smart grid, deep reinforcement learning, and their applications.

장호천(Haotian Zhang)
../../Resources/kiee/KIEE.2025.74.6.1030/au2.png

Haotian Zhang received the B.S. degree in mechanical engineering from Qingdao University of Science and Technology, Qingdao, China, and Hanyang University, Ansan, South Korea, in 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang University, Seoul, South Korea. His research interests include optimal control, smart grid, deep reinforcement learning, and their applications.

이민주(Minju Lee)
../../Resources/kiee/KIEE.2025.74.6.1030/au3.png

Minju Lee received the B.S. degree in climate and energy systems engineering from Ewha Womans University, Seoul, South Korea, in 2022, where she is currently pursuing the degree with the Department of Climate and Energy Systems Engineering. Her research interests include short-term wind power forecasting and the probabilistic estimation of transmission congestion for grid integration.

이명훈(Myoung Hoon Lee)
../../Resources/kiee/KIEE.2025.74.6.1030/au4.png

Myoung Hoon Lee received the B.S. degree in electrical engineering from Kyungpook National University, Daegu, South Korea, in 2016, and the Ph.D. degree in electrical engineering from the Ulsan National Institute of Science and Technology, Ulsan, South Korea, in 2021. From 2021 to 2023, he was a Postdoctoral Research Fellow with the Research Institute of Electrical and Computer Engineering, Hanyang University, Seoul, South Korea. He is currently an Assistant Professor with the Department of Electrical Engineering, Incheon National University, Incheon, South Korea. His research interests include decentralized optimal control, mean field games, deep reinforcement learning, and their applications.

문준(Jun Moon)
../../Resources/kiee/KIEE.2025.74.6.1030/au5.png

Jun Moon is currently an Associate Professor in the Department of Electrical Engineering at Hanyang University, Seoul, South Korea. He received the B.S. degree in electrical and computer engineering, and the M.S. degree in electrical engineering from Hanyang University, Seoul, South Korea, in 2006 and 2008, respectively. He received the Ph.D. degree in electrical and computer engineering from University of Illinois at Urbana-Champaign, USA, in 2015. From 2008 to 2011, he was a researcher at Agency for Defense Development (ADD) in South Korea. From 2016 to 2019, he was with the School of Electrical and Computer Engineering, Ulsan National Institute of Science and Technology (UNIST), South Korea, as an assistant professor. From 2019 to 2020, he was with the School of Electrical and Computer Engineering, University of Seoul, South Korea, as an associate professor. He is a recipient of the Fulbright Graduate Study Award 2011. His research interests include stochastic optimal control and filtering, reinforcement learning, data-driven control, distributed control, networked control systems, and mean field games.