Title Comparison of Crime Forecasting Models based on Spatio-Temporal Data and Machine Learning
Authors 김동영(Kim, Dongyoung) ; 정성원(Jung, Sungwon)
DOI https://doi.org/10.5659/JAIK.2021.37.1.135
Page pp.135-143
ISSN 2733-6247
Keywords Machine Learning; Crime Forecasting; Resampling; Burglary; Environmental Criminology
Abstract With the advancement in computer performance and data analysis techniques, research using big data and machine learning is actively underway in various fields. However, regarding the domestic crime prediction research using machine learning, the current related studies are insufficient because disclosure of crime data is restricted and most of these studies predicted crimes by using a wide range of analysis units or by focusing on a few variables. To effectively distribute police power through practical crime prediction, it is necessary to predict the time and place of the crime. Therefore, in this study, we train machine learning models with 9,413 instances of actual theft crime data having temporal-spatial elements such as crime time and date, buildings, land-use, and CCTV. Thereby, we intend to provide a basis for future research and assist crime prevention activities practically by comparing the results of the prediction models. In this study, we divided the target land into 100 m square grids by using GIS and then inserted crime and temporal-spatial related variables. Subsequently, we trained the typical machine learning models such as random forest, decision trees, SVC, and K-NN, conducted crime prediction, and compared the results of the models. In the case of crime data, generally, an excessive amount of unbalanced data is present for the places where crimes did not occur compared to places where crimes occurred. Unbalanced data can result in noise and cause inaccurate predictions-these issues must be addressed. Therefore, in this study, we proposed a resampling method as an alternative to solve data imbalances and provide crime prediction with improved accuracy. The results of the comparison of the prediction performance of the models showed that the F1 score of the random forest model using the SMOTE method was high. This could be because the data loss of the SMOTE method is less than that of the under-sampling method and the random forest as an ensemble type model has an advantage in predicting data with various variables. We compared the influence of each variable by employing the feature importance function. Overall, the temporal-related variables showed high influence-among them, "crimes occurred within one month" showed the highest influence. Among the physical environment-related variables, "first neighborhood living facility," "retail store," and "detached house" were found to have high influence.