| Keywords |
BEMS; Literature classification; Large language model; Fleiss’ kappa; Gwet’s AC1 |
| Abstract |
This study aims to examine the applicability and reliability of the LLM-as-a-Judge approach for literature classification using large language models (LLMs). To this end, a systematic literature review was conducted on domestic research related to Building Energy Management Systems (BEMS), and automated literature classification was performed using ChatGPT-5.1 under both zero-shot and few-shot settings. Subsequently, four researchers independently evaluated the appropriateness of the LLM-generated classification results. Inter-rater agreement and classification reliability were analyzed using final consensus rates, Fleiss’ kappa, and Gwet’s AC1 statistics. The results indicate that, even under the zero-shot setting, a meaningful level of agreement was observed between the LLM-based classification and human evaluations. However, under the few-shot setting?where predefined classification criteria established by the researchers were provided?the final consensus rate improved substantially to 95.83%. Despite an increase in observed inter-rater agreement in the few-shot condition, a decrease in Fleiss’ kappa was observed, corresponding to the well-known kappa paradox. Complementary analysis using Gwet’s AC1 confirmed the performance improvement of the few-shot approach. Overall, this study demonstrates that, rather than directly adopting automatically generated LLM outputs, combining clear, user-defined classification criteria with appropriate human intervention is effective in enhancing classification reliability and consistency in LLM-based literature classification. These findings may serve as a methodological foundation for future research trend analysis and automated literature classification in the field of building energy, including BEMS applications |