Title |
Comparison of Crime Forecasting Models based on Spatio-Temporal Data and Machine Learning |
Authors |
김동영(Kim, Dongyoung) ; 정성원(Jung, Sungwon) |
DOI |
https://doi.org/10.5659/JAIK.2021.37.1.135 |
Keywords |
Machine Learning; Crime Forecasting; Resampling; Burglary; Environmental Criminology |
Abstract |
With the advancement in computer performance and data analysis techniques, research using big data and machine learning is actively
underway in various fields. However, regarding the domestic crime prediction research using machine learning, the current related studies are
insufficient because disclosure of crime data is restricted and most of these studies predicted crimes by using a wide range of analysis units
or by focusing on a few variables. To effectively distribute police power through practical crime prediction, it is necessary to predict the
time and place of the crime. Therefore, in this study, we train machine learning models with 9,413 instances of actual theft crime data
having temporal-spatial elements such as crime time and date, buildings, land-use, and CCTV. Thereby, we intend to provide a basis for
future research and assist crime prevention activities practically by comparing the results of the prediction models. In this study, we divided
the target land into 100 m square grids by using GIS and then inserted crime and temporal-spatial related variables. Subsequently, we trained
the typical machine learning models such as random forest, decision trees, SVC, and K-NN, conducted crime prediction, and compared the
results of the models. In the case of crime data, generally, an excessive amount of unbalanced data is present for the places where crimes
did not occur compared to places where crimes occurred. Unbalanced data can result in noise and cause inaccurate predictions-these issues
must be addressed. Therefore, in this study, we proposed a resampling method as an alternative to solve data imbalances and provide crime
prediction with improved accuracy. The results of the comparison of the prediction performance of the models showed that the F1 score of
the random forest model using the SMOTE method was high. This could be because the data loss of the SMOTE method is less than that
of the under-sampling method and the random forest as an ensemble type model has an advantage in predicting data with various variables.
We compared the influence of each variable by employing the feature importance function. Overall, the temporal-related variables showed
high influence-among them, "crimes occurred within one month" showed the highest influence. Among the physical environment-related
variables, "first neighborhood living facility," "retail store," and "detached house" were found to have high influence. |