1. Introduction
Real estate markets play a crucial role in national and regional economies, influencing
investment decisions, policy formulation, financial risk management, and regional
economic disparities [1]. As housing prices directly affect household wealth and borrowing capacity, accurate
forecasting of real estate price dynamics has become increasingly important. However,
real estate price forecasting remains challenging due to the nature of transaction
data. Unlike conventional financial time-series, real estate transactions are recorded
only when they occur, resulting in irregular observation intervals and extended periods
without observations. This event-driven structure produces discontinuous time-series,
where underlying price dynamics evolve continuously but are not directly observed.
Consequently, model performance is often constrained not by model capacity but by
the quality and structure of the input data.
The Korean real estate market provides a suitable setting for addressing this issue.
A nationwide transaction price disclosure system enables large-scale analysis at the
individual asset level, while the market is largely centered on apartment complexes,
where units within the same complex share similar physical and locational characteristics
[2]. In addition, housing prices exhibit regional co-movement, indicating that price
dynamics are influenced by both temporal trends and spatial interactions.
Despite these advantages, the event-driven structure of transaction data still introduces
fundamental challenges for time-series modeling. Most forecasting models assume regularly
spaced observations, but irregular intervals disrupt temporal continuity and hinder
the learning of stable patterns and long-term dependencies. Existing approaches, such
as removing incomplete observations or applying interpolation, partially address this
issue but often fail to capture underlying market dynamics effectively.
This study addresses this limitation by proposing a volatility-aware reconstruction
method that transforms fragmented transaction records into continuous apartment-level
time-series. Transaction data are reorganized into monthly sequences, and unobserved
intervals are reconstructed by combining local temporal continuity with region-level
price dynamics. The proposed approach enables the reconstructed series to better reflect
underlying market behavior while preserving temporal consistency.
The main contributions of this study are as follows. First, we propose a data-centric
approach that transforms event-driven transaction data into continuous sequences suitable
for time-series forecasting. Second, we introduce a volatility-aware reconstruction
method that integrates regional price dynamics to restore unobserved intervals. Third,
we demonstrate that improved temporal consistency leads to performance gains across
models and regions.
The remainder of this paper is organized as follows. Section 2 reviews related work.
Section 3 presents the proposed methodology. Section 4 describes the experimental
design and results. Section 5 concludes the paper.
2. Related Work
Real estate price forecasting has evolved from statistical models to machine learning
and deep learning approaches, with increasing emphasis on temporal dynamics, spatial
relationships, and the integration of diverse data sources.
2.1 Global Advances in Real Estate Price Forecasting
Early studies focus on statistical approaches such as hedonic price models and multiple
regression analysis, which explain housing prices based on structural and locational
attributes [3], [4]. While these models provide interpretability, they have limited ability to capture
nonlinear relationships. Time-series models such as ARIMA are used to capture temporal
dependencies in aggregated data [5].
To address nonlinear patterns in the real estate market, machine learning methods
including Random Forest, boosting, and support vector machines are widely applied
and outperform traditional regression-based approaches [6]. As real estate prices are increasingly interpreted as time-evolving processes, deep
learning models such as RNN, LSTM, and GRU have emerged as effective tools for capturing
sequential dependencies and long-term temporal patterns [7]. Hybrid approaches combining statistical and deep learning models have also been
explored to capture both linear and nonlinear structures [8].
More recent studies integrate diverse data sources and adopt deep learning-based sequence
models to capture complex temporal patterns. For example, Chiu applied an LSTM model
to forecast housing prices using housing price index data and related variables, demonstrating
the effectiveness of deep learning in capturing temporal patterns [9]. Kishor also examined house price forecasting using macroeconomic fundamentals, credit
conditions, and supply indicators, highlighting the importance of financial and supply-side
factors [10].
In particular, recent work incorporates spatial relationships alongside temporal modeling.
Graph-based approaches capture interactions between housing units or regions, while
external features such as transportation accessibility and socioeconomic indicators
are used to explain price variation. Ge proposes a combined LSTM and Graph CNN framework
to model both temporal trends and spatial dependence [11], and Moghimi et al. developed a graph-based model to address spatial and temporal
irregularities in real estate data [12]. These studies emphasize the importance of jointly modeling temporal and spatial
dependencies.
Recent studies address incomplete observations and data sparsity from a spatial perspective,
such as spatial interpolation, geostatistical modeling, and sample expansion. Kim
et al. explored machine learning and spatial interpolation methods to estimate house
prices in locations without transaction records [13]. Sellam et al. proposed a multi-head gated attention model for spatial interpolation,
while Cellmer and Kobylińska combined machine learning with geostatistical methods
to incorporate spatial effects in housing price prediction [14], [15]. Zhang et al. further addressed sparsity in housing price index construction by expanding
usable samples through spatial relationships between housing units [16].
While these approaches improve estimation in sparse settings, they focus on spatial
relationships rather than reconstructing irregular transaction records into continuous
time-series. Specifically, limited attention has been given to reconstructing irregular
transaction data into continuous apartment-level time-series by jointly modeling temporal
continuity and regional market dynamics. To address this gap, this study proposes
a volatility-aware reconstruction method that transforms sparse transaction records
into continuous apartment-level time-series.
2.2 Evolution of Real Estate Price Forecasting in Korea
In Korea, real estate forecasting studies have developed under a transaction-based
data environment shaped by the real estate transaction price disclosure system. This
system provides detailed records of individual property transactions. However, observations
occur only when transactions take place, resulting in inherently irregular and sparse
time-series at the apartment level.
Due to this structure, early studies have relied on aggregated price indices rather
than raw transaction data [17]. By aggregating transaction records at the regional level, these approaches transform
irregular observations into regularly structured time-series suitable for conventional
models. Unlike many global studies that rely on aggregated indices, Korean studies
frequently utilize transaction-level data, which leads to irregular and sparse observation
patterns.
Subsequent studies incorporate macroeconomic and property-level variables to improve
predictive performance. Bae and Yu integrated macroeconomic indicators such as interest
rates and price indices with apartment-level features [18]. Spatial and regional factors are widely considered, reflecting the influence of
socioeconomic conditions, infrastructure, and accessibility on housing prices [19]. To support model training, transaction data are often reorganized into structured
formats, such as monthly aggregated time-series at the regional level [20]. Recent studies apply deep learning models with regional information, showing that
prediction performance is sensitive to spatial unit definitions and local contextual
features. Other studies incorporate spatial context using surrounding facility information,
demonstrating that regional characteristics improve prediction accuracy [21].
Despite these efforts, most Korean studies rely on aggregated regional time-series,
spatial estimation, or structured inputs. As a result, transaction data remain sparse
and discontinuous at the individual apartment level, limiting the ability to capture
continuous price dynamics over time. Motivated by this limitation, this study reconstructs
transaction-driven data into continuous apartment-level time-series by explicitly
modeling event-driven sparsity and incorporating regional price dynamics. This approach
differs from prior methods that primarily rely on spatial estimation or aggregated
representations, enabling more realistic modeling of apartment-level price trajectories.
3. Methodology
This section presents a framework for constructing apartment-level time-series from
irregular, event-driven transaction records. The objective is to transform fragmented
observations into structured representations suitable for predictive modeling. Unlike
conventional approaches that treat missing values as isolated issues, the proposed
method focuses on reconstructing temporal continuity.
Figure 1 illustrates the overall process, in which raw transaction records are transformed
into structured time-series representations. The proposed framework consists of three
stages: data preprocessing, volatility-aware reconstruction, and model development.
Each stage addresses key limitations of transaction data, including heterogeneity
across sources, irregular observation intervals, and unobserved periods.
Fig. 1. Overview of the proposed data construction approach for apartment-level time-series
forecasting
3.1 Data Description
The dataset used in this study is obtained from a publicly available Kaggle dataset
constructed from real estate transaction records collected via a Korean public API
[22]. The data include apartment-level transaction information such as transaction prices
and transaction dates. Samples corresponding to three metropolitan cities, namely
Seoul, Busan, and Daegu, are used for analysis. The study period spans from January
2015 to April 2023, covering a total of 100 months. The dataset is organized as a
monthly panel, where each apartment unit forms a fixed-length time-series.
Figure 2 shows the distribution of observed months per apartment unit across the three regions
before applying the minimum-observation filtering criterion. The distribution is highly
concentrated in low-observation intervals, indicating that many apartment units have
only a small number of observed transaction months. This pattern reflects the event-driven
sparsity and irregular temporal structure of the original transaction data.
Fig. 2. Distribution of observed transaction months per apartment unit in the original
dataset
To construct a reliable experimental dataset for sequence-based forecasting, apartment
units with fewer than 50 observed months were excluded. After applying this filtering
criterion, the final dataset consists of 1,486 apartment units and 148,600 monthly
observations, including both observed and missing entries, of which 90,966 correspond
to actual transaction records.
At the regional level, the dataset includes 546 apartment units and 33,124 transactions
in Seoul, 441 units and 27,123 transactions in Busan, and 499 units and 30,719 transactions
in Daegu. Due to the event-driven nature of real estate transactions, observations
are recorded only when transactions occur, resulting in substantial sparsity at the
apartment level. On average, each apartment unit has approximately 61 observed months
out of the total 100-month period, corresponding to an observation density of 0.61
and a missing rate of 0.39.
3.2 Data Preprocessing
Locational and structural factors are essential determinants of housing prices in
the Korean market. Variables reflecting transportation accessibility, educational
infrastructure, and building characteristics are therefore included to capture both
regional context and property-specific attributes. Transaction records and external
data are integrated into a unified structure. Because data sources differ in spatial
identifiers, temporal formats, and internal structures, a consistent analytical schema
is defined.
Temporal and spatial standardization is performed to ensure consistent agreement on
transaction dates and regions. Transaction dates are converted into monthly timestamps,
and regional identifiers are unified to ensure consistent matching. External data
are transformed into the same spatial and temporal units. School data are aggregated
at the regional level, and subway accessibility data are reshaped into a monthly format.
These variables are then merged with transaction records using region and time as
common keys, resulting in a dataset that integrates transaction information with locational
context.
Finally, all variables are organized into a consistent data structure. The dataset
is sorted by apartment attributes and time in preparation for sequence construction.
Although this process resolves inconsistencies across sources, transaction records
remain sparse, as observations exist only when transactions occur. Therefore, an additional
reconstruction step is required to generate continuous time-series.
3.3 Volatility-Aware Data Reconstruction
This stage transforms irregular transaction records into continuous apartment-level
time-series. Since transactions occur only upon sale, reconstructing transaction records
to handle unobserved intervals is a prerequisite for applying sequence-based models.
The Apartment-Unit Identifier (AUID) is a unique code that defines the smallest unit
of each apartment and is assigned based on location, apartment complex, and exclusive
area. Transactions within the same month are aggregated into a single observation,
and monthly sequences are constructed over the full study period.
This process converts irregular transaction records into aligned apartment-level panels,
where unobserved months are explicitly represented. Each apartment sequence is linked
to an administrative regional identifier to preserve spatial context. In this study,
experiments are conducted separately for each metropolitan city (Seoul, Busan, and
Daegu), with regional information defined at the district level (Si-Gun-Gu), representing
the administrative subdivision within each city. Accordingly, each apartment is associated
with a corresponding district within its city.
To ensure data reliability, apartments with extremely sparse observations are excluded.
Although this process organizes irregular transaction records into structured sequences,
it does not resolve the absence of price observations in months without transactions.
Therefore, an additional reconstruction method is required to estimate these unobserved
values and recover continuous price trajectories.
The reconstruction integrates two complementary components: a local estimate and a
regional estimate, to recover continuous price trajectories. The local estimate captures
price changes based on the transaction history of an individual apartment, while the
regional estimate reflects overall market trends shared across apartments within the
same region. By combining these two components, the method aims to estimate realistic
price movements during months without transactions.
The reconstructed price is defined as
where Plin denotes the local estimate obtained from linear interpolation and Preg
denotes the regional estimate derived from market-level dynamics. The weights wlin
and wreg control the relative influence of the local and regional estimates. The weights
are designed to balance local temporal continuity and regional dynamics. This design
is motivated by the complementary properties of the two estimates: local estimates
become more reliable as the number of observed transactions increases, whereas regional
trends provide more stable estimates under sparse observation conditions by leveraging
aggregated market information. To capture overall market behavior, the regional price
trend is first constructed. At each time step, the regional average price is calculated
at the district level (Si-Gun-Gu), where all observed apartment transaction prices
within the same district and month are aggregated:
where Pk,t represents the transaction price of apartment k at time t, and Nt is the
number of apartments with observed transactions at that time. This value represents
the average housing price within a district at a given month.
To reduce short-term fluctuations and noise, the resulting series is smoothed using
a 3-month moving average, yielding a stable regional price trend denoted as St. The
choice of the 3-month window is further validated through sensitivity analysis presented
in Section 4.1.
Based on this smoothed trend, a regional volatility factor is computed as
This factor represents the relative change in the regional market compared to the
previous time step.
The regional estimate is then obtained by applying this factor to the previous observed
price:
This indicates that when an apartment has no transaction in a given month, its price
is updated according to the overall market movement of the region.
The local estimate is obtained using linear interpolation based on the apartment’s
own transaction history. This method connects observed prices across time, ensuring
smooth transitions between known values.
To balance these two estimations, their fidelity is adjusted based on the observation
density for each apartment. The weight assigned to the local estimate increases linearly
with the data fidelity of each apartment, while the regional weight is defined as
the complementary portion so that the two weights sum to one. Apartments with many
observed transactions provide reliable information for interpolation, so the local
estimate is given greater importance. In contrast, apartments with few observations
rely more on regional trends, since their individual price history is less informative.
This adaptive weighting ensures stable and realistic reconstruction across varying
data conditions.
The final price is obtained as a weighted combination of the local and regional estimates.
Reconstruction is applied only to months without transactions, while observed transaction
prices remain unchanged. The proposed method is applied to transaction prices, whereas
other variables are interpolated using standard linear methods.
Consistent with this reconstruction framework, this study focuses on forecasting the
temporal evolution of prices for existing apartment units using reconstructed time-series
data.
3.4 Forecasting Model Development
The reconstructed time-series ensure temporal continuity prior to sequence construction.
Although interpolation may utilize both past and future observations, prediction targets
are strictly excluded from input construction. Each input sequence contains only observations
preceding the prediction time step, preventing data leakage.
The reconstructed time-series are transformed into model-ready sequences using a sliding-window
approach. Sequences are generated chronologically without random shuffling. Each input
consists of 12 months of observations, and the subsequent price is used as the prediction
target. Specifically, a sequence from time t to t+11 is used to predict the price
at time t+12, and this process is repeated across the entire time-series.
To ensure a proper time-series forecasting setting, the dataset is partitioned using
a consistent temporal criterion. For each apartment unit (AUID), the full observation
period (100 months) is divided chronologically into training (first 70 months), validation
(next 10 months), and testing (final 20 months). This time-based split is applied
consistently across all AUIDs, ensuring that training data always precede validation
and test data.
Each sequence includes both apartment-specific and regional variables. Numerical features
are standardized, and categorical identifiers are encoded to capture spatial heterogeneity.
Machine learning models use flattened feature vectors, whereas deep learning models
retain sequential structure to capture temporal dependencies. To evaluate the proposed
method, three data configurations are considered: removal of incomplete sequences,
linear interpolation, and the proposed volatility-aware reconstruction.
4. Experimental Results
4.1 Experimental Setup
Experiments were conducted using apartment transaction data from Seoul, Busan, and
Daegu. All experiments were conducted on the final dataset described in Section 3.1.
To ensure a fair comparison across data construction strategies, a common eligible
AUID pool was first defined. Specifically, AUIDs with fewer than 50 observed transaction
months over the 100-month study period were excluded to avoid unreliable sequence
construction from extremely sparse histories. This filtering criterion was applied
identically to all experimental settings. The dataset was partitioned based on a consistent
time-based split, where the full observation period was divided chronologically into
training, validation, and testing subsets, ensuring that training data always precede
validation and test data across all AUIDs.
Table 1. Model architecture and key parameters
|
Model
|
Key Parameters
|
|
XGBoost
|
n_estimators=1000, max_depth=7
|
|
LightGBM
|
n_estimators=1000, num_leaves=31
|
|
LSTM
|
1 recurrent layer, hidden_dim=128
|
|
GRU
|
1 recurrent layer, hidden_dim=128
|
|
Transformer
|
d_model=128, heads=8, encoder layers=2
|
Table 2. Sensitivity analysis of moving average window (GRU model)
|
Region
|
Window
|
R2
|
MAE
|
RMSE
|
MAPE
|
|
Seoul
|
1
|
0.9855
|
6,013
|
7,994
|
5.56
|
|
3
|
0.9883
|
5,315
|
7,217
|
4.93
|
|
6
|
0.9878
|
5,535
|
7,380
|
5.13
|
|
Busan
|
1
|
0.9795
|
3,256
|
4,334
|
10.23
|
|
3
|
0.9807
|
3,196
|
4,226
|
10.13
|
|
6
|
0.9804
|
3,232
|
4,263
|
10.17
|
|
Daegu
|
1
|
0.9880
|
1,103
|
1,662
|
4.69
|
|
3
|
0.9884
|
1,076
|
1,639
|
4.58
|
|
6
|
0.9881
|
1,103
|
1,660
|
4.72
|
All methods were evaluated on the same set of AUIDs, ensuring that performance differences
arise from the data construction strategy rather than differences in the underlying
sample composition. In the transaction-driven sampling setting, incomplete sequences
are excluded at the sequence level rather than removing entire apartment units. Thus,
all methods are evaluated on an identical set of AUIDs, and differences arise solely
from how missing observations are handled. Apartment-level time-series were then constructed,
and forecasting performance was evaluated under three data reconstruction strategies:
transaction-driven sampling, linear interpolation, and the proposed volatility-aware
reconstruction. Categorical regional identifiers were encoded using learnable embeddings
and combined with numerical features.
Table 3. Forecasting performance comparison under different data reconstruction strategies
in Seoul
|
Strategy
|
Model
|
R2
|
MAE
|
RMSE
|
MAPE
|
|
Transaction-driven Sampling
|
XGBoost
|
0.9225
|
10,819
|
17,930
|
9.28
|
|
LightGBM
|
0.9237
|
10,420
|
17,795
|
8.78
|
|
LSTM
|
0.9411
|
10,246
|
15,637
|
9.57
|
|
GRU
|
0.955
|
8,999
|
13,666
|
8.03
|
|
Transformer
|
0.9271
|
10,850
|
17,392
|
9.39
|
|
Linear Interpolation
|
XGBoost
|
0.9723
|
7,814
|
11,094
|
7.07
|
|
LightGBM
|
0.9521
|
8,324
|
14,599
|
6.56
|
|
LSTM
|
0.9775
|
8,367
|
10,009
|
8.03
|
|
GRU
|
0.9874
|
5,687
|
7,488
|
5.27
|
|
Transformer
|
0.9705
|
8,593
|
11,446
|
8.18
|
|
Volatility-Aware Reconstruction
|
XGBoost
|
0.9709
|
7,623
|
11,358
|
6.94
|
|
LightGBM
|
0.9527
|
7,786
|
14,486
|
6.17
|
|
LSTM
|
0.9754
|
8,789
|
10,453
|
8.41
|
|
GRU
|
0.9883
|
5,315
|
7,217
|
4.93
|
|
Transformer
|
0.9734
|
7,616
|
10,870
|
6.89
|
Five models were evaluated: XGBoost, LightGBM, LSTM, GRU, and Transformer. For deep
learning models (LSTM, GRU, and Transformer), models were trained using the Adam optimizer
with a learning rate of 0.001 and mean squared error (MSE) loss. Training was conducted
for up to 100 epochs with early stopping to prevent overfitting. Table 1 summarizes the key model configurations. Performance was assessed using R², MAE,
RMSE, and MAPE, with R² and MAPE as the primary metrics. All experiments were implemented
in Python using PyTorch.
Table 4. Forecasting performance comparison under different data reconstruction strategies
in Busan
|
Strategy
|
Model
|
R2
|
MAE
|
RMSE
|
MAPE
|
|
Transaction-driven Sampling
|
XGBoost
|
0.8938
|
3,997
|
8,716
|
9.33
|
|
LightGBM
|
0.8932
|
4,012
|
8,742
|
9.43
|
|
LSTM
|
0.9321
|
3,975
|
6,970
|
10.66
|
|
GRU
|
0.9297
|
3,978
|
7,093
|
10.44
|
|
Transformer
|
0.8857
|
4,511
|
9,043
|
11.06
|
|
Linear Interpolation
|
XGBoost
|
0.9708
|
2,574
|
5,236
|
6.14
|
|
LightGBM
|
0.9459
|
2,779
|
7,125
|
5.23
|
|
LSTM
|
0.9861
|
2,497
|
3,606
|
7.39
|
|
GRU
|
0.98
|
3,347
|
4,337
|
10.7
|
|
Transformer
|
0.9611
|
3,762
|
6,045
|
10.12
|
|
Volatility-Aware Reconstruction
|
XGBoost
|
0.9735
|
2,391
|
4,945
|
5.86
|
|
LightGBM
|
0.9499
|
2,547
|
6,805
|
5.18
|
|
LSTM
|
0.986
|
2,454
|
3,602
|
7.11
|
|
GRU
|
0.9807
|
3,196
|
4,226
|
10.13
|
|
Transformer
|
0.9655
|
3,494
|
5,644
|
9.22
|
In addition, the sensitivity of the smoothing window used in the regional volatility
factor was examined. Additional experiments were conducted using 1-, 3-, and 6-month
moving averages, and the results are summarized in Table 2. As shown in Table 2, the 3-month window provided the best performance across all regions. Compared with
the 1-month window, it reduced short-term volatility, whereas the 6-month window tended
to smooth out recent changes excessively. These results indicate that the 3-month
window offers a balanced trade-off between stability and responsiveness.
4.2 Results and Analysis
Tables 3–5 present the performance comparison across different data handling strategies and
model architectures. MAE and RMSE values are reported in units of 10,000 KRW. Across
all regions, the choice of data reconstruction strategy had a substantial impact on
performance, comparable to differences between model architectures.
Table 5. Forecasting performance comparison under different data reconstruction strategies
in Daegu
|
Strategy
|
Model
|
R2
|
MAE
|
RMSE
|
MAPE
|
|
Transaction-driven Sampling
|
XGBoost
|
0.9443
|
2,055
|
3,280
|
7.95
|
|
LightGBM
|
0.9421
|
2,011
|
3,343
|
7.62
|
|
LSTM
|
0.9528
|
1,956
|
3,018
|
7.72
|
|
GRU
|
0.9543
|
1,902
|
2,971
|
7.49
|
|
Transformer
|
0.9419
|
2,086
|
3,349
|
8.07
|
|
Linear Interpolation
|
XGBoost
|
0.9773
|
1,467
|
2,313
|
5.75
|
|
LightGBM
|
0.9673
|
1,648
|
2,776
|
6.15
|
|
LSTM
|
0.9841
|
1,247
|
1,933
|
5.1
|
|
GRU
|
0.9871
|
1,188
|
1,740
|
5.26
|
|
Transformer
|
0.9751
|
1,701
|
2,422
|
6.86
|
|
Volatility-Aware Reconstruction
|
XGBoost
|
0.9797
|
1,369
|
2,170
|
5.35
|
|
LightGBM
|
0.9738
|
1,307
|
2,463
|
4.83
|
|
LSTM
|
0.985
|
1,179
|
1,863
|
4.77
|
|
GRU
|
0.9884
|
1,076
|
1,639
|
4.58
|
|
Transformer
|
0.9769
|
1,611
|
2,314
|
6.51
|
Removing incomplete observations consistently yielded the lowest performance. For
example, in Seoul, the GRU model achieved an R² of 0.955 and a MAPE of 8.03%. Applying
linear interpolation led to substantial improvements across all regions by restoring
temporal continuity. In Seoul, the GRU model improved to an R² of 0.9874, with MAPE
reduced from 8.03% to 5.27%. This confirmed that preserving consistent time intervals
is essential for effective time-series modeling.
The proposed volatility-aware reconstruction further improved performance over linear
interpolation. In Seoul, GRU achieved an R² of 0.9883 and reduced MAPE to 4.93%. Similar
improvements were observed in other regions, such as Busan, where XGBoost reduced
MAPE from 6.14% to 5.86%, and Daegu, where GRU improved from 5.26% to 4.58%. These
results indicate that incorporating regional price dynamics enhances forecasting accuracy
beyond local continuity.
Model-wise, sequence-based models such as GRU and LSTM achieved strong performance
when applied to reconstructed time-series, while tree-based models such as XGBoost
and LightGBM maintained stable performance across regions. This suggests that improvements
in data structure benefit different model types in distinct ways.
Table 6. Performance comparison by observation density (GRU model)
|
Group
|
Strategy
|
R2
|
MAE
|
RMSE
|
MAPE
|
Low
(20–49)
|
Transaction-driven
|
0.9625
|
6,552
|
13,758
|
9.21
|
|
Linear Interpolation
|
0.9988
|
1,311
|
2,631
|
2.88
|
|
VA Reconstruction
|
0.9983
|
1,635
|
3,217
|
2.47
|
Medium
(50–79)
|
Transaction-driven
|
0.9747
|
4,928
|
9,114
|
8.60
|
|
Linear Interpolation
|
0.9945
|
2,847
|
4,433
|
5.53
|
|
VA Reconstruction
|
0.9931
|
3,389
|
4,980
|
6.68
|
High
(80+)
|
Transaction-driven
|
0.9824
|
5,703
|
10,015
|
10.14
|
|
Linear Interpolation
|
0.9842
|
5,612
|
9,381
|
9.49
|
|
VA Reconstruction
|
0.9846
|
5,526
|
9,272
|
9.41
|
These characteristics are reflected in the magnitude of performance improvement. In
Seoul, reconstruction led to the largest gains; for example, under the GRU model,
MAE decreased from 8,999 to 5,315. This indicates that restoring temporal continuity
and incorporating regional dynamics are particularly effective in highly volatile
markets. In Daegu, improvements are smaller but still consistent. Under the GRU model,
MAE decreased from 1,902 to 1,076, reflecting relatively smooth price dynamics.
This suggests that reconstruction contributes to stable forecasting even in less volatile
markets. Busan shows intermediate behavior, with performance improvements observed
across reconstruction strategies but greater variability across models. For example,
under the reconstruction method, XGBoost achieves the lowest MAE (2,391), while GRU
shows relatively higher errors. This indicates that forecast performance in Busan
is more sensitive to model choice and data characteristics.
Overall, the results demonstrate that the effectiveness of the reconstruction method
is closely associated with regional market characteristics, with larger gains in more
dynamic markets and consistent improvements across all regions.
To further analyze the effect of observation density on reconstruction performance,
we conducted a breakdown analysis by grouping AUIDs according to the number of observed
transaction months across all regions. The dataset was divided into three groups:
low-frequency (20–49 observed months), medium-frequency (50–79), and high-frequency
(80 or more). The results are summarized in Table 6. As shown in Table 6, the effectiveness of the proposed method varies across observation density levels.
In the low-frequency group, a substantial performance improvement is observed when
interpolation is applied, with MAPE decreasing significantly compared to the transaction-driven
sampling setting. This improvement is primarily attributed to the restoration of temporal
continuity in the input sequences, as fragmented sequences limit the ability of sequence-based
models to capture temporal dependencies. In contrast, interpolation reconstructs continuous
time-series, enabling more effective learning of temporal patterns. In the medium-frequency
group, the proposed method shows slightly lower performance than linear interpolation.
This suggests that when sufficient local observations are available, linear interpolation
can capture temporal patterns effectively, and the incorporation of regional dynamics
may introduce additional variability. In the high-frequency group, performance differences
between methods become marginal, as most observations are already available. However,
the proposed method shows marginal improvements, indicating that incorporating regional
trends can contribute to more stable predictions.
These results indicate that the effectiveness of the proposed approach depends on
observation density, highlighting both its strengths and limitations. In particular,
the method is most beneficial when the data are sparse but still contain sufficient
structure, while its relative advantage decreases when local observations are already
sufficient. Other factors, such as price volatility and regional scale, may also influence
the effectiveness of the proposed approach, and further analysis of these factors
is left for future work.
5. Conclusions
This study addresses a key structural limitation of real estate transaction data,
where event-driven sparsity results in discontinuous time-series representations.
To overcome this issue, we propose a volatility-aware reconstruction method that transforms
transaction records into continuous apartment-level time-series. Experimental results
demonstrate that ensuring temporal continuity is essential for accurate real estate
price forecasting. While interpolation improves performance by restoring continuity,
the proposed approach further enhances accuracy by incorporating regional market dynamics
and yields consistent performance improvements across models and regions.
This study contributes to the literature by treating unobserved intervals as a structural
modeling problem rather than a simple preprocessing task. By integrating apartment-level
temporal continuity with region-level dynamics, the proposed approach provides a more
realistic representation of housing price evolution. The results also indicate that
improvements in data structure can have an impact comparable to model selection. While
sequence-based models benefit from reconstructed temporal patterns, tree-based models
remain competitive. This suggests that enhanced data quality consistently improves
performance across model types. Consistent performance gains across Seoul, Busan,
and Daegu demonstrate the robustness of the proposed approach under different market
conditions, supporting its applicability to event-driven real estate data.
Despite these contributions, several limitations remain. The use of a fixed moving
average window may introduce a lag in reflecting sudden market changes, such as rapid
macroeconomic shocks. Although the 3-month window provides a balanced trade-off between
stability and responsiveness, it may not fully capture abrupt shifts in market conditions.
Region-level aggregation may not fully capture complex spatial interactions, and the
set of explanatory variables remains limited. In addition, formal statistical significance
tests were not conducted in this study, which may limit the strength of the performance
comparison. Other factors such as price volatility and regional scale may influence
performance, and their effects warrant further investigation. Finally, the proposed
framework assumes an offline reconstruction setting and does not fully reflect a strictly
causal real-time forecasting scenario. Future research can extend this work by incorporating
more detailed spatial modeling and adaptive reconstruction mechanisms, as well as
validating the approach in other markets.