The Journal of
the Korean Society on Water Environment

The Journal of
the Korean Society on Water Environment

Bimonthly
  • ISSN : 2289-0971 (Print)
  • ISSN : 2289-098X (Online)
  • KCI Accredited Journal

Editorial Office

Title Enhancing Chlorophyll-a Prediction in River Systems Using Integrated Anomaly Detection and Data Balancing Techniques
Authors 강덕준(Dejun Jiang) ; 권혁구(Hyuk-Ku Kwon)
DOI https://doi.org/10.15681/KSWE.2026.42.2.177
Page pp.177-186
ISSN 2289-0971
Keywords Algal blooms; Imbalanced regression; Isolation forest; SMOGN; XGBoost
Abstract Global climate change and human-induced nutrient loading have intensified the eutrophication of aquatic ecosystems. This has led to frequent harmful algal blooms (HABs) in river systems, posing risks to water security and ecosystem health. To manage water quality proactively, accurate predictions of chlorophyll-a (chl-a) levels are essential. However, data-driven modeling encounters challenges such as sensor noise and the infrequent occurrence of high-concentration algal blooms compared to typical background conditions, a situation referred to as imbalanced regression. This study implemented a machine learning pipeline in the Miho River basin, utilizing hydro-chemical data from 2016 to 2025. The Isolation Forest (IForest) algorithm was employed to identify and remove multivariate outliers caused by sensor errors, thus ensuring the integrity of the training data. Two data augmentation strategies were assessed against a baseline Extreme Gradient Boosting (XGBoost) model to address distributional imbalance: Gaussian Noise (GN) injection and the Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise (SMOGN). While GN augmentation yielded only marginal improvements, the SMOGN-augmented model demonstrated superior performance. The SMOGN-XGBoost model achieved the highest overall accuracy (R² = 0.81, RMSE = 4.49 μg/L). In the critical high-concentration range (top 25%), the SMOGN model enhanced explanatory power (R²) by 30.25% compared to the baseline. Feature importance analysis revealed that the balanced model exhibited increased sensitivity to dissolved oxygen (DO). Integrating anomaly detection with SMOGN-based data balancing presents a practical framework for early warning systems in river environments characterized by imbalanced data.