Title |
Model Choice Meets Prompt Choice : A Dual-Factor Study of Zero-Shot Low-Resource Plant Recognition |
Authors |
좌희정(Heejung Jwa) ; 정문희(Munhee Jeong) ; 조정원(Jungwon Cho) |
DOI |
https://doi.org/10.5370/KIEE.2025.74.8.1426 |
Keywords |
Jeju Plants; Classification; Zero-Shot Learning; Image-Text Alignment; Multimodal Embedding |
Abstract |
In this study, we have assessed the zero-shot classification performance of Jeju Island plant images using five multimodal vision?language models: CLIP, SigLIP, SigLIP Multilingual, SigLIP SO400M, and SigLIP2. Evaluation data comprised image?text pairs of plant species collected from four ecologically distinct regions (Deonggae Coast, Min-oleum, Jabaebong, and Jeju City). All models were evaluated under an identical zero-shot classification protocol to ensure a fair comparison. Among them, SigLIP SO400M achieved the highest accuracy on the Deonggae Coast subset, attaining a macro accuracy of 0.7460 and a micro accuracy of 0.7612, thereby outperforming the other models. The prompt language format exerted a significant influence on performance: English-only prompts consistently surpassed Korean-only prompts across all models. Confusion matrix analysis revealed region-specific class-level misclassification patterns, identifying species prone to frequent confusion. Collectively, these results demonstrate the robust zero-shot classification capabilities of contemporary vision?language models for fine-grained plant species identification and underscore the importance of selecting both an appropriate model and prompt format for a given task. The code used for these experiments is publicly available at github.com/flyaround365/JejuPlantsClassification. |