왕천
(Chen Wang)
1iD
장호천
(Haotian Zhang)
1iD
이민주
(Minju Lee)
2iD
이명훈
(Myoung Hoon Lee)
††iD
문준
(Jun Moon)
†iD
-
(Department of Electrical Engineering, Hanyang University, Seoul 04763, Republic of
Korea.)
-
(KEPCO Research Institute, Daejeon Metropolitan City 34056, Republic of Korea.)
Copyright © The Korea Institute for Structural Maintenance and Inspection
Key words
Deep reinforcement learning, option-critic framework, topology control, smart grid.
1. Introduction
As renewable energy sources like wind and solar are increasingly integrated into power
systems, ensuring efficient and secure power transmission has become more challenging
[1,2]. In this context, traditional model-based control and management methods for power
systems are beginning to show their limitations. Recently, with the rise of neural
networks (NN), deep reinforcement learning (DRL) control methods have gained significant
attention [3]. Several studies have explored the use of DRL models for optimizing and controlling
the power grid [4-6]. Specifically, [4] investigated a multi-agent residential smart grid and proposed an optimal policy
for minimizing economic costs using parametric model predictive control (MPC) under
a deterministic policy gradient (DPG)-based reinforcement learning algorithm. [5] applied a model-based and data-driven DRL algorithm to resolve line overload issues
in power systems, while [6] introduced an emergency load shedding method based on a deep deterministic policy
gradient (DDPG) algorithm to enhance stable power system operations through autonomous
voltage control.
As power systems continue to develop and expand, their size and complexity are showing
a steady increase [7]. In this context, conventional model-based automatic control methods encounter increasing
challenges in meeting the demands of power grid operations. Traditional control approaches
primarily focus on regulating generators and loads. However, as the power network
expands, these methods may lack flexibility, especially when dealing with the grid's
variability, efficient energy integration, and system security concerns [8-10]. In light of these challenges, topology control methods have gained considerable
attention. Compared to other control strategies, topology control offers a more cost-effective
approach to managing the power grid. This method reconfigures the grid structure by
adjusting the connections of power lines and the distribution of buses, which effectively
reduces congestion and
enhances power transmission efficiency. A distinct advantage of topology control is
its ability to quickly respond to changes in grid topology, helping to lower the risk
of system failures and improve the overall stability and robustness of the power system
[11,12]. This method not only enhances the flexibility and sustainability of power systems
but also offers opportunities to improve system performance. As a result, topology
control plays a crucial role in modernizing and optimizing power systems.
1.1 Related works
Classic control methods in power systems, such as MPC and proportional integral derivative
(PID) control, are highly dependent on a detailed dynamic model of the system, which
needs to be accurately represented by an accurate mathematical model for optimal decision
making. As a result, building such models in complex or highly dynamic environments
is often a huge challenge. In contrast, DRL, as a data-driven approach can effectively
circumvent the need for detailed mathematical models and efficiently cope with the
complexity and uncertainty of power system environments by utilizing a large amount
of data and an iterative trial-and-error approach to find the optimal control policy.
In recent years, the rapid advancement of DRL has led to the widespread application
of various baseline DRL algorithms in power system control. These algorithms encompass
deep Q-network (DQN) [15], proximal policy optimization (PPO) [14], and soft actor-critic (SAC) [16], all of which have demonstrated outstanding performance across various power system
control scenarios. Specifically, in [17], the authors utilized the DQN and double deep Q-network (DDQN) algorithms for scheduling
household appliances to determine the optimal energy scheduling policy. [18] introduces a novel method for addressing the alternating current optimal power flow
problem by employing an advanced PPO algorithm, which aids grid operators in rapidly
developing accurate control policies to ensure system security and economic efficiency.
To address uncertainties such as the intermittency of wind energy and load flexibility,
[19] applies the SAC algorithm for energy dispatch optimization.
Additionally, [2] introduced ‘Grid2Op’, a power system simulation platform designed to address control
challenges in power systems using artificial intelligence. ‘Grid2Op’ is an open-source
framework compatible with the OpenAI Gym [20], offering a convenient tool for building and controlling power systems with reinforcement
learning algorithms. Specifically, ‘Grid2Op’ integrates seamlessly with DRL algorithms
to enable effective power system management by intelligently regulating power line
switching states or bus distribution. A key advantage of using the ‘Grid2Op’ platform
is that it allows experiments with real power system data, making it possible to train
and
simulate using any DRL algorithm. As a result, ‘Grid2Op’ has become one of the leading
simulation platforms for power systems, offering scenario-based simulations that provide
higher realism and credibility for experimental results [2,21].
Among the studies on power system topology control using DRL in the “Grid2Op” platform,
the authors in [12] employed the cross-entropy method (CEM) reinforcement learning algorithm to manage
power flow through topology switching actions, analyzing the variability and types
of topologies. In [22], the authors applied the dueling double deep Q-network algorithm combined with a
prioritized replay mechanism to control topology changes in power systems, achieving
notable results. Additionally, [23] proposed a method that combines imitation learning with the SAC deep reinforcement
learning algorithm to enable stable autonomous control in the IEEE 118-Bus power system,
demonstrating the method's effectiveness and robustness in topology optimization.
1.2 Main Contributions
In this paper, an option-critic based DRL method for topology control policy in power
systems is proposed, with the main contributions outlined as follows:
1. We integrate the option-critic (OC) algorithm with long short-term memory (LSTM)
neural networks to capture time-series features in high-dimensional power system environments.
The LSTM networks help model these features, while the option-based DRL algorithm
decomposes the large and complex action space into executable options, effectively
reducing the dimensionality of the action space in the power system.
2. We apply the proposed OC-LSTM algorithm to the topology control policy in the power
system and compare it with a baseline DRL algorithm. Experimental results demonstrate
that the OC-LSTM method enables stable operation of the IEEE 5-Bus, IEEE 14-Bus, and
L2RPN WCCI 2020 power system environments for 60 hours, without requiring any human
expert intervention.
To evaluate the performance of the OC-LSTM, we conducted simulation experiments on
the ‘Grid2Op’ platform, using three power system simulation environments of varying
scales. We then compared our algorithm with baseline DRL algorithms, namely dueling
double deep Q-network (DDDQN) [13] and proximal policy optimization (PPO) [14], to validate the effectiveness of OC-LSTM. Furthermore, to assess the OC-LSTM's effectiveness
in addressing the power system topology control problem, we compared the optimal model
of our algorithm against the $Do Nothing Action$. The results demonstrate that OC-LSTM
can maintain stable operation for a longer period than $Do Nothing Action$, without
the need for human expert intervention.
The paper is organized as follows: In Section 2, we introduce the OC framework and
the LSTM network, followed by the proposed OC with LSTM algorithm (OC-LSTM). Section
3 describes the implementation of power system topology control simulation experiments
and validates the effectiveness of the proposed methodology through simulation results.
Finally, we conclude the paper in Section 4.
2. Option-Based Reinforcement Learning with Long Short-Term Memory
In this section, we introduce the Option-Critic (OC) framework in detail, which is
followed by the core constituent units of the Long Short-Term Memory (LSTM) neural
networks. By combining these two advantages, we propose the OC-LSTM algorithm, which
combines policy hierarchy with temporal feature extraction to effectively improve
the stability and performance of power system topology control.
2.1 Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) can be described as a Markov Decision Process (MDP)
formalized as a 5-tuple $<S,\: A,\: P,\: R,\: \gamma >$, where $S$ denotes the state
space, $A$ denotes the action space, and $P$ denotes the probability of state transfer,
defined as the probability of transferring to the next state after taking an action
in the current state; $R$ is a reward function representing the feedback provided
by the environment at each moment; $\gamma$ is a discount factor to measure the importance
of future rewards.
At each time step $t$, the policy function $\pi_{t}$ determines the action $a_{t}$
chosen by the intelligent in state $s_{t}$. The agent interacts with the environment
through the policy $\pi$. After executing an action, the environment updates the state
and provides the corresponding reward feedback $r_{t}$. The main goal of DRL is to
learn the optimal policy function by maximizing the cumulative reward. For this, the
state-value function $V_{\pi}$ and the state-action-value function $Q_{\pi}$ are
introduced to evaluate the policy quality and predict future rewards.
Specifically, the state-value function $V_{\pi}$ represents the expected cumulative
reward starting from state $s$ and following policy $\pi$ :
In contrast, the state-action-value function $Q_{\pi}$ extends this concept by considering
the expected cumulative reward from taking a specific action $a$ at state $s$ and
then following the policy $\pi$.
The difference is that $V_{\pi}$ evaluates the value of being in a specific state,
while $Q_{\pi}$ evaluates the value of taking an action in a specific state.
2.2 Option-Critic Framework
The OC framework [22] is a policy gradient-based approach for simultaneously learning the intra-option
policy and the option termination function. The method allows agent to employ a hierarchical
decision structure in task learning and execution, enabling them to make more accurate
decisions over longer time horizons and thus be more adaptable to complex power system
environments. A Markov option $\omega\inΩ$ can be represented as a tuple $(I_{\omega},\:
\pi_{\omega},\: \beta_{\omega})$, where $I_{\omega}$ is the initialization set, $\pi_{\omega}$
is the intra-option policy, and $\left.\beta_{\omega}:S\right.→[0,\: 1]$ defines the
termination probability of the option at each state.
The OC framework selects an option based on the option policy $\pi_{Ω}$ and executes
its policy within that option $\pi_{\omega}$ until the termination condition is met.
Once the current option terminates, the system selects the next option and continues
the process. The internal strategy $\pi_{\omega ,\: \theta}$ of an option is parametrically
represented by the parameter $\theta$, while the termination condition $\pi_{\omega
,\: v}$ of an option is parametrically represented by the parameter $v$. The state-option-action
value function $Q_{U}(s_{t},\: w,\: a)$ represents the expected cumulative return
when option $\omega$ is selected in state st and action $a$ is executed. This function
can be expressed as the following equation:
In (3), note that the value of the state $s_{t+1}$ reached through the option $\omega$ can
be expressed as follows:
where $Q_{Ω}(s_{t},\: w)=\sum_{a}\pi_{\omega ,\:\theta}(a,\: s_{t})Q_{U}(s_{t},\:
\omega ,\: a)$ represents the option value function, and $V_{Ω}(s_{t})=\sum_{\omega}\pi_{\omega}(\omega
,\: s_{t})Q_{Ω}(s_{t},\: \omega)$ represents the option-level state value function.
In this case, the option $\omega$ terminates with probability $\beta_{\omega ,\: v}(s_{t+1})$,
leading to the selection of a new option, or continues with the current option with
probability $1-\beta_{\omega ,\: v}(s_{t+1})$.
The intra-option policies and option termination function can be learned by using
the policy gradient theorem to maximize the expected discounted return . In the case
where the initial condition is $(s_{0},\: \omega_{0})$, the gradient of the objective
function for the intra-option policy parameter $\theta$ can be expressed as:
Similarly, the gradient with respect to option termination parameters $v$ with initial
condition $(s_{1},\: \omega_{0})$ is:
where $A_{Ω}(s,\: \omega)=Q_{Ω}(s,\: \omega)-V_{Ω}(s)$ is the advantage of choosing
option $\omega$ in state $s$.
2.3 Long-Short Term Memory
Long Short-Term Memory networks (LSTM) [25] are a variant of Recurrent Neural Networks (RNN) [26] that effectively capture feature relationships in time series data while addressing
the issues of vanishing and exploding gradients that RNN encounter when processing
long sequences. One of the main challenges in applying DRL to power system control
is finding the optimal control strategy within the vast action and state spaces. To
overcome this challenge, we introduce LSTM networks, which can capture time-related
information relevant to the target task in high-dimensional state spaces and prevent
issues such as gradient vanishing during training.
In the LSTM structure, the forget gate $f_{t}$, input gate $i_{t}$, input node $g_{t}$,
and output gate $o_{t}$ are defined at time $t$, with weights and bias values denoted
by $W$ and $b$, respectively. The sigmoid activation function and the tanh activation
function are represented by $\sigma$ and $\phi$. The following describes the input-output
mapping relationships of each node, with the specific workflow of the LSTM illustrated
in Fig. 1.
Fig. 1. LSTM network structure.
First, the forget gate $f_{t}$ uses the sigmoid activation function to determine the
information to discard from the input $x_{t}$ and the previous output $h_{t-1}$, represented
as:
Next, the input gate $i_{t}$ similarly takes the input information $x_{t}$ and the
intermediate output $h_{t-1}$ from the previous time step, applying the sigmoid activation
function to decide what information to store. This is combined with the next input
node $g_{t}$ to determine what to retain, expressed as:
Subsequently, the internal state $s_{t-1}$ is updated to the new state $s_{t}$ using
the input gate $i_{t}$ and the forget gate $f_{t}$, as shown in the following equation:
Finally, the input information $x_{t}$ and the intermediate output $h_{t-1}$ serve
as inputs to the output gate $o_{t}$, which is processed through a sigmoid activation
function. The final output $h_{t}$ is generated based on the internal state $s_{t}$,
represented by:
The LSTM model effectively maintains the dependencies between information by storing
node information in the internal state through the cooperation of internal nodes.
2.4 OC-LSTM : Option-Critic with Long Short-Term Memory
The OC-LSTM algorithm aims to solve the topology control problem in power systems
by combining the LSTM and Option-Critic frameworks, and its algorithm structure is
shown in Fig. 2.
First, OC-LSTM learns and processes state inputs in the power system through LSTM
networks. LSTM can capture a large amount of key state information in the power system
that contains temporal features, such as power generation, load profiles, and changes
in the grid topology, to help the agent better understand the dynamic features of
the system.
Fig. 2. The overall OC-LSTM framework.
Next, the extracted temporal state information is fed into the OC framework. In this
framework, the policy over option selects an appropriate option based on the current
state.
Then, the intra-option policy determines the specific action to be executed and uses
the termination function to decide whether to continue with the current option or
terminate it and select a new option based on the policy over options.
In this process, the critic evaluates the performance of the intra-option policy and
the termination function and updates the intra-option policy and the termination function
using the policy gradient to optimize the behavior of the agent.
By combining the ability of LSTM to extract temporal state features with the OC hierarchical
policy framework, the OC-LSTM algorithm effectively copes with the challenges posed
by the increase in state space and action dimensions in power system topology control
problems. LSTM is able to capture the dynamic changes of key state information in
the power system, thus providing rich temporal feature information for decision making.
Meanwhile, the OC framework introduces a hierarchical decision structure, which enables
the agent to select and execute control policies more efficiently in the face of complex
action spaces. This combination enhances the stability of the agent in changing environments.
3. Topology Control Problem and Simulation
Section 3 defines the power system topology control problem based on the Markov decision
process and details the simulation environment used in the experiments. Simulation
results verify the proposed method's performance and stability in power system topology
control.
3.1 Simulation environments
We will conduct simulations in three environments: IEEE 5-Bus, IEEE 14-Bus, and L2RPN
WCCI 2020, as shown in Fig. 3. These power system networks consist of generators (such as thermal power plants
and renewable energy facilities), loads (such as households and factories), and power
lines. Generators and loads are connected through substations, and different substations
are interconnected by power lines.
Fig. 3(a) shows the structure of the IEEE 5-Bus simulation environment, includes 8 power lines,
3 loads, 2 generators, and 5 substations, with 20 scenarios. Each scenario has a time
interval of 5 minutes, allowing a maximum of 288 survival steps per day.
Fig. 3. Power system environments.
Fig. 3(b) presents the IEEE 14-Bus simulation environment, which includes 20 power lines, 11
loads, 6 generators, and 14 substations. The IEEE 14-Bus simulation environment contains
1004 scenarios, each simulating the internal changes of the French power grid over
28 consecutive days. Similar to the IEEE 5-Bus environment, this simulation environment
allows a maximum of 288 survival steps per day. Each step records information on grid
topology and generator output, with details provided in the observation space section.
Consecutive steps capture the dynamic changes within the power system.
Fig. 3(c) illustrates the L2RPN WCCI 2020 environment, which is based on part of the IEEE 118-Bus
network and includes 59 power lines, 37 loads, 22 generators, and 36 substations.
The simulation dataset for the L2RPN WCCI 2020 environment covers 240 years of time
data, with the same number of survival steps in its simulation setup as the two aforementioned
environments.
All the above power system simulations are modeled on the “Grid2Op” platform to ensure
standardized and consistent experiments [2,28].
3.2 Define topology control with MDP
Observation space : We set up a unified observation space for the power system environment,
including v_or, v_ex, a_or, a_ex, ρ, and line_status. where v_or and v_ex denote the
voltage values on the buses connected to the start and end points of each power line,
a_or and a_ex denote the currents on the buses connected to the start and end points
of each power line, ρ denotes the thermal capacity of each power line, and line_status
denotes the connection status of each power line. Such a design enables the agent
to fully capture the critical dynamics of the power system currents, thus providing
an accurate decision basis for the optimal topology control problem. The observation
space dimensions of the three power system environments are shown in Table 1.
Table 1 Dimensions of observation and action space in three power system environments.
|
Observation Space
|
Action
Space
|
IEEE 5-Bus
|
69
|
9
|
IEEE 14-Bus
|
177
|
21
|
L2RPN WCCI 2020
|
531
|
60
|
Action space : We define the action space of the power system environment as change_line_status,
which represents changing the status of a power line (connected/disconnected), and
this action belongs to the discrete action type. This design is consistent with the
goal of this study, which is to achieve stable power system operation through topology
control of power lines. The action dimension consists of $A=(A_{Do Nothin g},\: A_{{l}in{e}})$,
where $A_{{l}in{e}}$ denotes the number of power lines that can be connected or disconnected,
and $A_{Do Nothin g}$ denotes an action that does not perform any operation. The action
space dimensions in the three power system simulation environments are shown in Table 1.
Reward function : We defined the reward function for the power system environment
(as shown in Equation (12)), which consists of two parts: feedback on the agent's behavior and the utilization
rate of line capacity. Specifically, if the agent's policy causes the power system
environment to terminate prematurely or if illegal actions are taken, the agent will
receive a minimum reward of 0. On the other hand, the reward calculation for line
capacity utilization is shown in Equation (11), where $\rho_{i}$ denotes the utilization rate of each power line ($\rho_{i}\in(0,\:
1)$) and $line_{i}$ denotes the connection status of each power line. This equation
reflects the capacity utilization of the power lines. When the utilization of all
lines is close to the maximum capacity, the reward value is close to 0. Conversely,
the lower the line utilization, the higher the reward value. This reward function
is designed to allow the agent to avoid overloading the power lines, thus maintaining
the stability of the power system.
where $x\in(0,\: 1)$, $A^{illegal \;or\; ambiguous}$ represents the set of
illegal or ambiguous actions as defined by the Grid2Op simulation platform and $S^{error}$
represents the premature termination of the environment.
To ensure that the agent learns to control the power grid safely and effectively,
it operates within a strictly constrained environment to maintain the grid's normal
operation. These constraints correspond to real-world operational limitations, and
violations will result in the termination of the environment. The constraints include:
1. Disconnecting too many power lines, causing an inability to meet load demand.
2. Interrupting the connection between generators and the grid.
3. Creating grid islands due to topological changes.
3.3 Simulation results
Fig. 4 presents the training results across three different power system environments, where
we compare the OC-LSTM algorithm with two typical DRL baseline algorithms: PPO and
DDDQN. DDDQN is an improved version of the DQN algorithm, which significantly outperforms
the original DQN in terms of performance. On the other hand, PPO is a widely used
policy optimization algorithm that maintains training stability and sampling efficiency
while gradually converging to the optimal policy. Both baseline algorithms are suitable
for handling high-dimensional state spaces and discrete action spaces. To ensure
the robustness of the results, each algorithm
Fig. 4. Training results of OC-LSTM (ours) and baseline algorithms in (a) IEEE 5-Bus,
(b) IEEE 14-Bus and (c) L2RPN WCCI 2020 environments.
was trained three times. In the training process, we use the survival steps as the
evaluation criterion, as this metric can intuitively reflect the agent's control capability.
In the simulation, the maximum operation time of the power system is set to 4 days,
with an environment sampling interval of 5 minutes. Therefore, the maximum survival
steps per episode is 1152. A longer survival step indicates that the agent can maintain
stable operation in the power system for a longer time through topology control, reflecting
stronger topology control performance. Conversely, poor topology control policies
will struggle to maintain the stable operation of the power system, resulting in a
lower number of survival steps.
3.3.1 Training
As shown in Fig. 4(a), the OC-LSTM algorithm and the PPO algorithm perform similar survival steps in the
IEEE 5-Bus environment. This indicates that both the OC-LSTM algorithm and the PPO
algorithm are effective in maintaining the stability of the grid in topology control
tasks for small-scale grids. In contrast, the DDDQN algorithm performs poorly in this
setting, possibly due to its more limited decision making in topology control, which
tends to fall into local optimality.
Fig. 4(b) shows the results in the IEEE 14-Bus environment. In this medium-scale power system,
the OC-LSTM algorithm significantly outperforms PPO and DDDQN and is able to maintain
stability in the number of surviving steps, reaching more than 720 survival steps,
which is equivalent to about 60 hours of operation time. This further validates that
the OC-LSTM algorithm can still maintain power system stability even as the state
and action spaces of the environment increase.
Fig. 4(c) shows the training results in the L2RPN WCCI 2020 environment. Even in this large-scale
power grid environment, the OC-LSTM algorithm still demonstrates better stability,
significantly outperforming PPO and DDDQN, and is able to maintain a long survival
step. This demonstrates that the OC-LSTM algorithm is able to maintain stable operation
of the grid by effectively controlling the topology of the grid even in the face of
a more complex power system.
3.3.2 Testing
The comprehensive analysis of the experimental results in small-scale (IEEE 5-Bus),
medium-scale (IEEE 14-Bus), and large-scale (L2RPN WCCI 2020) environments shows that
the OC-LSTM algorithm still maintains a good performance with the increase of the
state space and action space of the power system. Compared with the baseline algorithms
PPO and DDDQN, the OC-LSTM shows superior stability and control capability in different
scales of grid topology control tasks. This further demonstrates the effectiveness
of introducing a hierarchical policy architecture. Table 2 presents the mean and standard deviation of the survival steps for the agent in three
different power system environments after the algorithm converges. Each survival step
corresponds to 5 minutes, with a total of 288 survival steps representing one day.
These data reflect the agent's ability to maintain stable performance in three different
scales of power systems.
Table 2 The training results of agents in three power system environments. The data
includes mean and standard deviation, representing the training results of the agents
in different environments. Each step represents 5 minutes, with 288 steps in a day.
|
OC-LSTM
(Unit: step)
|
PPO
(Unit: step)
|
DDDQN
(Unit: step)
|
IEEE 5-Bus
|
894 ±43
|
803± 71
|
224 ±23
|
IEEE 14-Bus
|
728 ±98
|
378 ±60
|
170 ±59
|
L2RPN WCCI 2020
|
825± 61
|
77± 12
|
265 ±28
|
To further validate the improved feature extraction capability of the OC algorithm
combined with LSTM in power systems, we performed ablation experiments in the L2RPN
WCCI 2020 environment. Fig. 5 demonstrates the performance comparison between the OC-LSTM algorithm and the standard
OC algorithm, and the results show that the OC-LSTM algorithm has a significantly
higher number of survivor steps, even though its convergence speed is slightly slower
than that of the OC algorithm using a linear layer. This result demonstrates the advantage
of LSTM in extracting power system temporal features, which further enhances the control
ability of agent in complex grid environments.
We randomly selected 5 scenarios for evaluation in three different scales of power
system environments, utilizing the optimal model and comparing the OC-LSTM algorithm
with the $Do Nothin g Action$. The $Do Nothin g Action$ represents the scenario where
no action is taken in the power system throughout, showing the maximum survival steps
of the power system without any intervention. By comparing with $Do Nothin g Action$,
the effectiveness of the topology control learned by the model can be tested.
Fig. 5. Ablation experiment: L2RPN WCCI 2020 environment.
As shown in Fig. 6(a), in the IEEE 5-bus environment, the OC-LSTM algorithm consistently enables the power
system to operate stably for over 60 hours across all scenarios.
Fig. 6(b) presents the evaluation results in the IEEE 14-bus environment. The OC-LSTM algorithm
significantly outperforms the $Do Nothing Action$ in all scenarios, helping the system
operate for more than 60 hours.
Similarly, in the L2RPN WCCI 2020 environment (as shown in Fig. 6(c)), the OC-LSTM algorithm performs excellently, maintaining stable operation of the
large power system for over 60 hours, clearly outperforming the $Do Nothin g Action$.
Fig. 6. Testing results of OC-LSTM algorithm.
4. Conclusion
This research proposes a novel OC-LSTM algorithm for topology control problems in
power systems. The algorithm achieves stable control for up to 60 hours in power system
environments such as IEEE 5-Bus, IEEE 14-Bus, and L2RPN WCCI 2020, and all operations
are performed without the intervention of human experts. Compared with the baseline
Deep Reinforcement Learning (DRL) algorithm, the OC-LSTM algorithm is more adapted
to power system applications in high-dimensional state and action spaces by introducing
a hierarchical policy and extracting temporal features with the LSTM. The LSTM efficiently
captures the dynamics of the power system and extracts the key temporal features,
which enables the agent to better comprehend the environmental information and make
accurate decisions. This combination not only enhances the performance of OC-LSTM
in complex power system environments, but also significantly improves its application
in power systems.
Future research can focus on further optimizing the intra-option policy in the OC-LSTM
algorithm to enhance its control efficiency in environments where large-scale renewable
energy and power storage devices are deployed. This will help to increase the value
of OC-LSTM applications in real power systems and provide stronger support for sustainable
energy management.
Acknowledgements
This research was supported in part by the KEPCO under the project entitled by “Development
of GW class voltage sourced DC linkage technology for improved interconnectivity and
carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in
part by the Institute of Information & communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).
Acknowledgements
This research was supported in part by the KEPCO under the project entitled by “Development
of GW class voltage sourced DC linkage technology for improved interconnectivity and
carrying capacity of wind power in the Sinan and southwest regions(R22TA12) and in
part by the Institute of Information & communications Technology Planning and Evaluation
(IITP) grant funded by the Korea government (Ministry of Science and ICT, MSIT) (RS-2020-II201373).
References
Z. Zhang, D. Zhang and R. C. Qiu, “Deep reinforcement learning for power system applications:
An overview,” CSEE Journal of Power and Energy Systems, vol. 6, no. 1, pp. 213–225,
2019. DOI:10.17775/CSEEJPES.2019.00920

A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly, A. O’Sullivan, J. Viebahn, M. Awad,
I. Guyon, P. Panciatici and C. Romero, “Learning to run a power network challenge:
a retrospective analysis,” in NeurIPS 2020 Competition and Demonstration Track. PMLR,
pp. 112–132, 2021. https://proceedings.mlr.press/v133/marot21a.html

D. Ernst, M. Glavic and L. Wehenkel, “Power systems stability control: reinforcement
learning framework,” IEEE Transactions on Power Systems, vol. 19, no. 1, pp. 427–435,
2004. DOI:10.1109/TPWRS.2003.821457

W. Cai, H. N. Esfahani, A. B. Kordabad and S. Gros, “Optimal management of the peak
power penalty for smart grids using mpc-based reinforcement learning,” in 2021 60th
IEEE Conference on Decision and Control (CDC), pp. 6365–6370, 2021. DOI:10.1109/CDC45484.2021.9683333

M. Kamel, R. Dai, Y. Wang, F. Li and G. Liu, “Data-driven and model-based hybrid reinforcement
learning to reduce stress on power systems branches,” CSEE Journal of Power and Energy
Systems, vol. 7, no. 3, pp. 433–442, 2021. DOI:10.17775/CSEEJPES.2020.04570

J. Li, S. Chen, X. Wang and T. Pu, “Load shedding control strategy in power grid emergency
state based on deep reinforcement learning,” CSEE Journal of Power and Energy Systems,
vol. 8, no. 4, pp. 1175–1182, 2021. DOI:10.17775/CSEEJPES.2020.06120

H. Yousuf, A. Y. Zainal, M. Alshurideh and S. A. Salloum, “Artificial intelligence
models in power system analysis,” in Artificial Intelligence for Sustainable Development:
Theory, Practice and Future Applications. Springer, vol. 912, pp. 231–242, 2020. DOI:10.1007/978-3-030-51920-9_12

C. Zhao, U. Topcu, N. Li and S. Low, “Design and stability of load-side primary frequency
control in power systems,” IEEE Transactions on Automatic Control, vol. 59, no. 5,
pp. 1177–1189, 2014. DOI:10.1109/TAC.2014.2298140

Y. Zhang, X. Shi, H. Zhang, Y. Cao and V. Terzija, “Review on deep learning applications
in frequency analysis and control of modern power system,” International Journal of
Electrical Power & Energy Systems, vol. 136, no. 107744, pp. 1–18, 2022. DOI:10.1016/j.ijepes.2021.107744

A. K. Ozcanli, F. Yaprakdal and M. Baysal, “Deep learning methods and applications
for electrical power systems: A comprehensive review,” International Journal of Energy
Research, vol. 44, no. 9, pp. 7136–7157, 2020. DOI:10.1002/er.5331

D. Yoon, S. Hong, B. J. Lee and K. E. Kim, “Winning the l2rpn challenge: Power grid
management via semi-markov afterstate actor-critic,” in 9th International Conference
on Learning Representations, ICLR 2021, pp. 1–12, 2021. https://openreview.net/forum?id=LmUJqB1Cz8

M. Subramanian, J. Viebahn, S. H. Tindemans, B. Donnot and A. Marot, “Exploring grid
topology reconfiguration using a simple deep reinforcement learning approach,” in
2021 IEEE Madrid PowerTech, pp. 1–6, 2021. DOI:10.1109/PowerTech46648.2021.9494879

I. Damjanovi´c, I. Pavi´c, M. Brˇci´c and R. Jerˇci´c, “High performance computing
reinforcement learning framework for power system control,” in 2023 IEEE Power & Energy
Society Innovative Smart Grid Technologies Conference (ISGT). IEEE, pp. 1–5, 2023.
DOI:10.1109/ISGT51731.2023.10066416

J. Schulman, F. Wolski, P. Dhariwal, A. Radford and O. Klimov, “Proximal policy optimization
algorithms,” arXiv preprint arXiv:1707.06347, 2017. DOI:10.48550/arXiv.1707.06347

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through
deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015. DOI:10.1038/nature14236

T. Haarnoja, A. Zhou, P. Abbeel and S. Levine, “Soft actor-critic: Off-policy maximum
entropy deep reinforcement learning with a stochastic actor,” in International Conference
on Machine Learning. PMLR, pp. 1861–1870, 2018,.

Y. Liu, D. Zhang and H. B. Gooi, “Optimization strategy based on deep reinforcement
learning for home energy management,” CSEE Journal of Power and Energy Systems, vol.
6, no. 3, pp. 572–582, 2020. DOI:10.17775/CSEEJPES.2019.02890

Y. Zhou, B. Zhang, C. Xu, T. Lan, R. Diao, D. Shi, Z. Wang and W. -J. Lee, “A data-driven
method for fast ac optimal power flow solutions via deep reinforcement learning,”
Journal of Modern Power Systems and Clean Energy, vol. 8, no. 6, pp. 1128–1139, 2020.
DOI:10.35833/MPCE.2020.000522

B. Zhang, W. Hu, D. Cao, T. Li, Z. Zhang, Z. Chen and F. Blaabjerg, “Soft actor-critic-based
multi-objective optimized energy conversion and management strategy for integrated
energy systems with renewable energy,” Energy Conversion and Management, vol. 243,
no. 114381, pp. 1–15, 2021. DOI:10.1016/j.enconman.2021.114381

G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang and W. Zaremba,
“Openai gym,” arXiv preprint arXiv:1606.01540, 2016. DOI:10.48550/arXiv.1606.01540

M. Lehna, J. Viebahn, A. Marot, S. Tomforde and C. Scholz, “Managing power grids through
topology actions: A comparative study between advanced rule-based and reinforcement
learning agents,” Energy and AI, vol. 14, no. 100276, pp. 1–11, 2023. DOI:10.1016/j.egyai.2023.100276

I. Damjanovi´c, I. Pavi´c, M. Puljiz and M. Brcic, “Deep reinforcement learning-based
approach for autonomous power flow control using only topology changes,” Energies,
vol. 15, no. 19, pp. 1–16, 2022. DOI:10.3390/en15196920

X. Han, Y. Hao, Z. Chong, S. Ma and C. Mu, “Deep reinforcement learning based autonomous
control approach for power system topology optimization,” in 2022 41st Chinese Control
Conference (CCC), pp. 6041–6046, 2022. DOI:10.23919/CCC55666.2022.9902073

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra and M. Riedmiller,
“Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602,
2013.

K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink and J. Schmidhuber, “Lstm:
A search space odyssey,” IEEE Transactions on Neural Networks and Learning Systems,
vol. 28, no. 10, pp. 2222–2232, 2016. DOI:10.1109/TNNLS.2016.2582924

Z. C. Lipton, J. Berkowitz and C. Elkan, “A critical review of recurrent neural networks
for sequence learning,” arXiv preprint arXiv:1506.00019, 2015. DOI:10.48550/arXiv.1506.00019

P. -L. Bacon, J. Harb and D. Precup, “The option-critic architecture,” in Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, pp. 1726–1734,
2017. DOI:10.1609/aaai.v31i1.10916

B. Donnot, “Grid2op- A testbed platform to model sequential decision making in power
systems,” 2020. https://GitHub.com/rte-france/grid2op

저자소개
Chen Wang received the B.S. degree in electronics and computer engineering and the
M.S. degree in electronic computer engineering from Chonnam National University, South
Korea, in 2020 and 2022. He is currently pursuing the Ph.D. degree in electrical engineering
at Hanyang University, Seoul, South Korea. His research interests include smart grid,
deep reinforcement learning, and their applications.
Haotian Zhang received the B.S. degree in mechanical engineering from Qingdao University
of Science and Technology, Qingdao, China, and Hanyang University, Ansan, South Korea,
in 2022. He is currently pursuing the Ph.D. degree in electrical engineering at Hanyang
University, Seoul, South Korea. His research interests include optimal control, smart
grid, deep reinforcement learning, and their applications.
Minju Lee received the B.S. degree in climate and energy systems engineering from
Ewha Womans University, Seoul, South Korea, in 2022, where she is currently pursuing
the degree with the Department of Climate and Energy Systems Engineering. Her research
interests include short-term wind power forecasting and the probabilistic estimation
of transmission congestion for grid integration.
Myoung Hoon Lee received the B.S. degree in electrical engineering from Kyungpook
National University, Daegu, South Korea, in 2016, and the Ph.D. degree in electrical
engineering from the Ulsan National Institute of Science and Technology, Ulsan, South
Korea, in 2021. From 2021 to 2023, he was a Postdoctoral Research Fellow with the
Research Institute of Electrical and Computer Engineering, Hanyang University, Seoul,
South Korea. He is currently an Assistant Professor with the Department of Electrical
Engineering, Incheon National University, Incheon, South Korea. His research interests
include decentralized optimal control, mean field games, deep reinforcement learning,
and their applications.
Jun Moon is currently an Associate Professor in the Department of Electrical Engineering
at Hanyang University, Seoul, South Korea. He received the B.S. degree in electrical
and computer engineering, and the M.S. degree in electrical engineering from Hanyang
University, Seoul, South Korea, in 2006 and 2008, respectively. He received the Ph.D.
degree in electrical and computer engineering from University of Illinois at Urbana-Champaign,
USA, in 2015. From 2008 to 2011, he was a researcher at Agency for Defense Development
(ADD) in South Korea. From 2016 to 2019, he was with the School of Electrical and
Computer Engineering, Ulsan National Institute of Science and Technology (UNIST),
South Korea, as an assistant professor. From 2019 to 2020, he was with the School
of Electrical and Computer Engineering, University of Seoul, South Korea, as an associate
professor. He is a recipient of the Fulbright Graduate Study Award 2011. His research
interests include stochastic optimal control and filtering, reinforcement learning,
data-driven control, distributed control, networked control systems, and mean field
games.