Title |
Robust Self-supervised Multi-frame Depth Estimation: In Search of Cost Volume Alternatives |
Authors |
김진현(Jinhyeon Kim) ; 김규동(Gyudong Kim) ; 나혁주(Hyukju Na) ; 장현성(Hyunsung Jang) ; 박재민(Jaemin Park) ; 황재기(Jaegi Hwang) ; 하남구(Namkoo Ha) ; 김영근(Young Geun Kim) ; 김승룡(Seungryong Kim) |
DOI |
https://doi.org/10.5573/ieie.2024.61.12.97 |
Keywords |
Deep learning; Self-supervised learning; Transformer; Cost volume; Depth estimation |
Abstract |
Self-supervised multi-frame depth estimation predicts depth by utilizing geometric cues from multiple input frames. Traditional methods rely on epipolar geometry to construct cost volumes, but they have two major drawbacks: (1) they assume a static environment and (2) they require pose information during inference. Consequently, these methods struggle in real-world scenarios with dynamic objects. In this paper, we propose using the cross-attention map as a comprehensive cost volume to address these limitations. We show that training the cross-attention layers for image reconstruction enables implicit learning of a warping function, similar to the explicit epipolar warping in conventional methods. We introduce CRoss-Attention map and Feature aggregaTor (CRAFT), designed to effectively aggregate and refine the full cost volume. We also implement CRAFT hierarchically, enhancing depth predictions through a coarse-to-fine approach. Evaluations on the Cityscapes datasets demonstrate that our method outperforms traditional techniques, showing robustness in challenging conditions with dynamic objects. |