Mobile QR Code
Title Transformer based 6DoF Pose Estimation for Visual SLAM
Authors 채재민(Jae-Min Chae) ; 이수찬(Soo-Chahn Lee)
DOI https://doi.org/10.5573/ieie.2021.58.12.49
Page pp.49-56
ISSN 2287-5026
Keywords Transformer; Self-attention; Hybrid network; Monocular camera; Visual odometry
Abstract In this paper, we propose an unsupervised learning method for monocular visual odometry using a hybrid network combining the Vision Transformer (ViT) and the Convolution Neural Network (CNN) structures. Inspired by the recent state-of-the-art performance of ViT in classification and segmentation, we combine the ViT as the inference portion together with a CNN as the feature generating portion for the monocular visual odometry problem. We divided features generated from CNN into patches of fixed size and calculated the association between the divided patches through the Self-Attention operation of ViT. Unlike the existing ViT, we gradually reduced dimensions of patches when the Self-attention operation is applied to reduce the computational cost. Finally, six estimated values representing the 6DoF of movement and rotation between two frames were obtained. We demonstrate performance improvements compared to previous CNN structures mostly comprising only convolution layers. We believe our work serves as an example for the potential of Transformer Networks and self-attention for applications in visual odometry.