Title An Exploratory Study of Benchmark Construction for Performance Evaluation of Large Language Models in the Building Environmental Domain
Authors 정창헌(Cheong, Chang Heon)
DOI https://doi.org/10.5659/JAIK.2026.42.5.291
Page pp.291-299
ISSN 2733-6247
Keywords Large language model; Built Environment; Benchmark; Domain Performance Evaluation; Expert Consensus
Abstract This study proposes a benchmark construction method for systematically evaluating the performance of large language models (LLMs) in the building environmental domain and presents an exploratory experiment applying the benchmark to on-device models. A dual-tier benchmark framework was developed, categorizing items into core performance indicators and extended performance indicators based on expert consensus. A total of 120 question?answer items were created using educational materials in the building environmental field, and their importance was assessed by experts. As a result, 32 items, or 26.7 percent, were classified as core performance indicators, 83 items, or 69.2 percent, as extended performance indicators, and 5 items, or 4.2 percent, were excluded from the benchmark. The proposed benchmark was then applied to evaluate two on-device LLMs. The results showed that the models achieved accuracy rates of 43.8 to 59.4 percent on core performance indicators and 39.1 to 52.2 percent on extended performance indicators. Both models demonstrated higher accuracy on the core performance indicators, suggesting that concepts with stronger expert consensus were more likely to be reflected in training data for LLMs. Overall, the findings indicate that a dual-tier benchmark based on expert consensus can serve as an effective tool for evaluating domain-specific knowledge in LLMs within the building environmental field.