Mobile QR Code
Title S2F-CLIP: CLIP-based Adaptive Fusion of Sequence and Similarity for Short-term Action Recognition
Authors 이영석(Yeong-seok Lee) ; 박윤하(Yun-ha Park)
DOI https://doi.org/10.5573/ieie.2026.63.4.69
Page pp.69-77
ISSN 2287-5026
Keywords CLIP; Action recognition; Sequence modeling; Adaptive gating; Efficiency
Abstract This paper proposes S2F-CLIP (Sequence-to-Fusion CLIP) that adaptively fuses a frame-level CLIP (Contrastive Language-Image Pre-training) embedding sequence and video-text similarity signals for short-term action recognition. The proposed model freezes the CLIP vision and text encoders, and summarizes frame embeddings extracted from an input video using a lightweight sequence encoder to produce sequence-based classification logits. In addition, it computes similarity-based logits using cosine similarity between class text embeddings constructed from predefined prompt templates and the corresponding video representation, and template-wise similarities are aggregated with a weighted average. The two logits are then combined with gating weights that reflect the sample and class context to obtain the final prediction. In addition, an auxiliary sequence classification loss and a batch-wise bidirectional KL (Kullback-Leibler) alignment loss are added to improve training stability and consistency between the two branches. On the UCF-101 dataset, the proposed model achieves 90.68% Top-1 accuracy.