IEIE - Journal of the Institute of Electronics and Information Engineers

Mobile QR Code

Main Menu


Title	Transformer based 6DoF Pose Estimation for Visual SLAM
Authors	채재민(Jae-Min Chae) ; 이수찬(Soo-Chahn Lee)
DOI	https://doi.org/10.5573/ieie.2021.58.12.49
Page	pp.49-56
ISSN	2287-5026
Keywords	Transformer; Self-attention; Hybrid network; Monocular camera; Visual odometry
Abstract	In this paper, we propose an unsupervised learning method for monocular visual odometry using a hybrid network combining the Vision Transformer (ViT) and the Convolution Neural Network (CNN) structures. Inspired by the recent state-of-the-art performance of ViT in classification and segmentation, we combine the ViT as the inference portion together with a CNN as the feature generating portion for the monocular visual odometry problem. We divided features generated from CNN into patches of fixed size and calculated the association between the divided patches through the Self-Attention operation of ViT. Unlike the existing ViT, we gradually reduced dimensions of patches when the Self-attention operation is applied to reduce the computational cost. Finally, six estimated values representing the 6DoF of movement and rotation between two frames were obtained. We demonstrate performance improvements compared to previous CNN structures mostly comprising only convolution layers. We believe our work serves as an example for the potential of Transformer Networks and self-attention for applications in visual odometry.

IEIEJournal of
the Institute of Electronics and Information Engineers