| Title |
A Study on Improving Predictive Performance and Reliability in Imbalanced Medical Data via Synthetic Data Augmentation |
| Authors |
김영태(YoungTae Kim) ; 황상원(SangWon Hwang) ; 고상백(SangBaek Koh) ; 서병석(ByungSuk Seo) |
| Keywords |
Synthetic data augmentation; Class imbalance; Medical data analysis; Probability calibration; Machine learning |
| Abstract |
Class imbalance, characterized by the extreme scarcity of positive cases, is a common challenge in medical data analysis. This study investigates the impact of synthetic data augmentation on classification performance and predictive reliability in highly imbalanced clinical datasets. Using the KoGES-ARIRANG cohort, we generated synthetic samples for various disease combinations based on a variational autoencoder (VAE) framework and systematically evaluated performance changes according to different augmentation levels. The results show that synthetic data augmentation consistently improves classification performance metrics, including F2-score, Matthews Correlation Coefficient (MCC), AUROC, and AUPRC. In particular, for rare disease combinations with only 30 positive samples, the F2-score increased by up to +0.189 and MCC by up to +0.221, while AUROC and AUPRC improved by up to +0.093 and +0.095, respectively. In addition, analyses of probability calibration metrics, including the Brier score and calibration measures, demonstrated concurrent improvements in predictive reliability. These findings suggest that synthetic data augmentation serves not only to enhance predictive performance but also to improve model robustness and reliability in highly imbalanced clinical settings. |