Title |
Motion-based Frame Sampling for Natural Language-based Vehicle Retrieval |
Authors |
김동영(Dongyoung Kim) ; 이경오(Kyoungoh Lee) ; 장인수(In-su Jang) ; 김광주(Kwang-Ju Kim) ; 김병근(Pyong-Kun Kim) ; 유재준(Jaejun Yoo) |
DOI |
https://doi.org/10.5573/ieie.2024.61.11.120 |
Keywords |
Multi-modal; Vision-language matching; NLP-based video retrieval |
Abstract |
Retrieving target vehicles through natural language descriptions plays a crucial role in intelligent transportation systems. Traditional methods employ models that leverage the correlation between textual and visual representations, such as CLIP, to perform retrieval tasks. However, since these models only handle image, they have struggled to capture the temporal dynamics of video data. Therefore, recent researchers have attempted to enhance temporal understanding through various data augmentation techniques and video encoders. Despite these efforts, conventional approaches frequently overlook the detailed temporal characteristics of vehicles. To address this limitation, we introduce a Motion-based Video Sampling method to effectively capture the detailed motion information of the target vehicle. Additionally, we build a robust model by implementing a re-ranking algorithm to handle various vehicle attributes. Finally, our proposed model achieves state-of-the-art performance on the public vehicle retrieval dataset. |