IEIE - Journal of the Institute of Electronics and Information Engineers

Mobile QR Code

Main Menu


Title	S2F-CLIP: CLIP-based Adaptive Fusion of Sequence and Similarity for Short-term Action Recognition
Authors	이영석(Yeong-seok Lee) ; 박윤하(Yun-ha Park)
DOI	https://doi.org/10.5573/ieie.2026.63.4.69
Page	pp.69-77
ISSN	2287-5026
Keywords	CLIP; Action recognition; Sequence modeling; Adaptive gating; Efficiency
Abstract	This paper proposes S2F-CLIP (Sequence-to-Fusion CLIP) that adaptively fuses a frame-level CLIP (Contrastive Language-Image Pre-training) embedding sequence and video-text similarity signals for short-term action recognition. The proposed model freezes the CLIP vision and text encoders, and summarizes frame embeddings extracted from an input video using a lightweight sequence encoder to produce sequence-based classification logits. In addition, it computes similarity-based logits using cosine similarity between class text embeddings constructed from predefined prompt templates and the corresponding video representation, and template-wise similarities are aggregated with a weighted average. The two logits are then combined with gating weights that reflect the sample and class context to obtain the final prediction. In addition, an auxiliary sequence classification loss and a batch-wise bidirectional KL (Kullback-Leibler) alignment loss are added to improve training stability and consistency between the two branches. On the UCF-101 dataset, the proposed model achieves 90.68% Top-1 accuracy.

IEIEJournal of
the Institute of Electronics and Information Engineers