AURIC - LLM as a Judge 기반 문헌 분류 기법의 신뢰도 검증: 국내 BEMS 연구 문헌 대상의 제로샷과 퓨샷 비교 분석

Current issue

Home > 2026-04

Download

Title	Reliability Evaluation of LLM-as-a-Judge?Based Literature Classification: A Comparative Analysis of Zero-Shot and Few-Shot Approaches Using Domestic BEMS Research
Authors	Seon-Woo Kim ; Byeong-Il Kwon ; Shin-Kyu Kang ; Dong-Won Kim ; Chang-Heon Cheong
Coverage (Cover Date)	Vol.33 No.2(2026-04)
Keywords	BEMS; Literature classification; Large language model; Fleiss’ kappa; Gwet’s AC1
Abstract	This study aims to examine the applicability and reliability of the LLM-as-a-Judge approach for literature classification using large language models (LLMs). To this end, a systematic literature review was conducted on domestic research related to Building Energy Management Systems (BEMS), and automated literature classification was performed using ChatGPT-5.1 under both zero-shot and few-shot settings. Subsequently, four researchers independently evaluated the appropriateness of the LLM-generated classification results. Inter-rater agreement and classification reliability were analyzed using final consensus rates, Fleiss’ kappa, and Gwet’s AC1 statistics. The results indicate that, even under the zero-shot setting, a meaningful level of agreement was observed between the LLM-based classification and human evaluations. However, under the few-shot setting?where predefined classification criteria established by the researchers were provided?the final consensus rate improved substantially to 95.83%. Despite an increase in observed inter-rater agreement in the few-shot condition, a decrease in Fleiss’ kappa was observed, corresponding to the well-known kappa paradox. Complementary analysis using Gwet’s AC1 confirmed the performance improvement of the few-shot approach. Overall, this study demonstrates that, rather than directly adopting automatically generated LLM outputs, combining clear, user-defined classification criteria with appropriate human intervention is effective in enhancing classification reliability and consistency in LLM-based literature classification. These findings may serve as a methodological foundation for future research trend analysis and automated literature classification in the field of building energy, including BEMS applications