| Title |
S2F-CLIP: CLIP-based Adaptive Fusion of Sequence and Similarity for Short-term Action Recognition |
| Authors |
이영석(Yeong-seok Lee) ; 박윤하(Yun-ha Park) |
| DOI |
https://doi.org/10.5573/ieie.2026.63.4.69 |
| Keywords |
CLIP; Action recognition; Sequence modeling; Adaptive gating; Efficiency |
| Abstract |
This paper proposes S2F-CLIP (Sequence-to-Fusion CLIP) that adaptively fuses a frame-level CLIP (Contrastive Language-Image Pre-training) embedding sequence and video-text similarity signals for short-term action recognition. The proposed model freezes the CLIP vision and text encoders, and summarizes frame embeddings extracted from an input video using a lightweight sequence encoder to produce sequence-based classification logits. In addition, it computes similarity-based logits using cosine similarity between class text embeddings constructed from predefined prompt templates and the corresponding video representation, and template-wise similarities are aggregated with a weighted average. The two logits are then combined with gating weights that reflect the sample and class context to obtain the final prediction. In addition, an auxiliary sequence classification loss and a batch-wise bidirectional KL (Kullback-Leibler) alignment loss are added to improve training stability and consistency between the two branches. On the UCF-101 dataset, the proposed model achieves 90.68% Top-1 accuracy. |