Title |
Image-based Approaches for Identifying Harmful Sites using OCR and Average Hash Methods |
Authors |
박시현(Si-Hyeon Park) ; 유성민(Seong-Min You) ; 송동호(Dong-Ho Song) ; 이광재(Kwangjae Lee) |
DOI |
https://doi.org/10.5370/KIEEP.2023.72.2.112 |
Keywords |
Web crawling; OCR; Average Hash; Harmful advertisements identification; Harmful site identification |
Abstract |
Recently, websites containing harmful information such as gambling, illegal drugs, pornography, and prostitution are exposed to the public. These harmful sites cause damage to copyright holders and related service industries, and cause various social problems. In this paper, we propose an image-based harmful site identification system using OCR and Average Hash techniques to identify and classify harmful sites. This system uses the characteristic that most gambling banner advertisements repeatedly use similar images, and analyzes the similarity with the average hash value of the banner advertisement image. And using Easy OCR, it determines whether the phrase written in the banner advertisement is harmful or not. To evaluate the performance of the proposed idea, a program was created to determine harmfulness by collecting and analyzing the site's banner advertisement image when the site name was entered, and it was confirmed that the discrimination accuracy was 84%. In addition, since the information collected while running the program is stored in the database, trends in harmful sites can be identified. This will be effectively used to search for harmful sites that are expected to occur |