The Journal of
the Korean Society on Water Environment

The Journal of
the Korean Society on Water Environment

Bimonthly
  • ISSN : 2289-0971 (Print)
  • ISSN : 2289-098X (Online)
  • KCI Accredited Journal

Editorial Office

Title The Impact of Data Anomalies on the Performance of Machine Learning Models for Algae Bloom Prediction
Authors 이은지(Eunji Lee) ; 박정수(Jungsu Park)
DOI https://doi.org/10.15681/KSWE.2025.41.5.313
Page pp.313-320
ISSN 2289-0971
Keywords Anomaly detection; Isolation forest; Water quality prediction; XGBoost
Abstract Field data often contain various anomalies due to natural variability and errors from sensors and experimental procedures. Since these anomalies can negatively affect model performance, it is crucial to detect and handle them. This study developed machine learning models to predict chlorophyll-a, a quantitative indicator of algal blooms, using water quality data collected in the field from 2015 to 2024 as independent variables. It also analyzed the impact of anomaly removal through an anomaly detection algorithm on model performance. First, datasets were constructed by randomly introducing anomalies into 5%, 10%, 15%, and 20% of the original data. Then, the Isolation Forest (IForest), an anomaly detection algorithm, was employed to detect and remove these anomalies. The effect of anomaly removal was assessed by applying the cleaned data to Extreme Gradient Boosting (XGBoost), an ensemble machine learning algorithm. The model trained on the original data achieved a root mean squared error (RMSE) of 7.541, while the RMSE of models trained on data with anomalies ranged from 8.777 to 17.503. Models trained on datasets with lower anomaly ratios demonstrated better performance. In contrast, models trained on data from which anomalies had been removed using IForest showed RMSE values ranging from 7.645 to 8.067. Similarly, better performance was observed in models trained on data with lower anomaly ratios prior to removal, although the performance differences based on the proportion of anomalies were relatively small. The results of this study demonstrate that anomaly removal can enhance the performance of machine learning models.