| Title |
An Exploratory Study of Benchmark Construction for Performance Evaluation of Large Language Models in the Building Environmental Domain |
| DOI |
https://doi.org/10.5659/JAIK.2026.42.5.291 |
| Keywords |
Large language model; Built Environment; Benchmark; Domain Performance Evaluation; Expert Consensus |
| Abstract |
This study proposes a benchmark construction method for systematically evaluating the performance of large language models (LLMs) in the
building environmental domain and presents an exploratory experiment applying the benchmark to on-device models. A dual-tier benchmark
framework was developed, categorizing items into core performance indicators and extended performance indicators based on expert consensus.
A total of 120 question?answer items were created using educational materials in the building environmental field, and their importance was
assessed by experts. As a result, 32 items, or 26.7 percent, were classified as core performance indicators, 83 items, or 69.2 percent, as
extended performance indicators, and 5 items, or 4.2 percent, were excluded from the benchmark. The proposed benchmark was then applied
to evaluate two on-device LLMs. The results showed that the models achieved accuracy rates of 43.8 to 59.4 percent on core performance
indicators and 39.1 to 52.2 percent on extended performance indicators. Both models demonstrated higher accuracy on the core performance
indicators, suggesting that concepts with stronger expert consensus were more likely to be reflected in training data for LLMs. Overall, the
findings indicate that a dual-tier benchmark based on expert consensus can serve as an effective tool for evaluating domain-specific
knowledge in LLMs within the building environmental field. |