GVDepth:
Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

Karlo Koledić, Luka Petrović, Ivan Marković, Ivan Petrović
University of Zagreb, Faculty of Electrical Engineering and Computing Laboratory for Autonomous Systems and Mobile Robotics
Comparison KITTI
Comparison DDAD

Zero-shot evaluation on KITTI and DDAD: GVDepth demonstrates competitive zero-shot accuracy on autonomous driving datasets, matching state-of-the-art zero-shot Monocular Depth Estimation (MDE) methods. Remarkably, this is achieved while being trained on a single dataset collected with a single camera setup, even though its data distribution significantly differs from the KITTI and DDAD datasets. Note: UniDepth_RP and Metric3D_RP are not entirely true zero-shot methods, as they require resizing and padding to align with the training resolution.

Abstract

Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup.


Methodology

Car Image
Intro Image

Depth cues. Depth can be calculated both as a function of object's imaging size and vertical image position.

Vertical Canonical Representation. We introduce a novel canonical representation, ensuring consistency of vertical image position cue for varying perspective geometries, facilitating learning and generalization.


Model architecture


Our model predicts two canonical depth representations, each paired with an uncertainty estimate. These are transformed into depth maps via the Focal Canonical Transform and Vertical Canonical Transform, which effectively disentangle camera parameters from depth, facilitating learning and generalization across arbitrary camera setups. By leveraging carefully designed transforms and targeted data augmentation, the approach encourages two depth maps to leverage distinct cues: one emphasizing object size and the other focusing on object's vertical image position. The final depth map is then computed through probabilistic fusion.


Results


Qualitative ablation




Highlights


We run extensive ablation studies on 5 autonomous driving datasets, demonstrating the improved generalization properties of proposed Vertical Canonical Representation and probabilistic fusion guided by estimated uncertanties.

  • The model with Vertical Canonical Representation exhibits better zero-shot accuracy than model leveraging object size cue on 18 out of 25 train/test dataset combinations
  • The model with fusion of both cues outperforms the model with object size cue on 24 out of 25 train/test dataset combinations

Why resolution adaptability matters



Metric3D and UniDepth are not fully zero-shot, as they overfit to the aspect ratio and typically high image resolution used during training. This necessitates image resizing and padding during evaluation, resulting in processing of blank pixels without adding any meaningful detail to the image.

In contrast, GVDepth is fully resolution agnostic, adapting seamlessly to the native resolution of the input image. This feature is especially advantageous for real-time systems, where dynamically adjusting image resolution is one of the simplest and most effective ways to control computational complexity.

BibTeX

@article{koledic2024gvdepth,
      title={GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion}, 
      author={Karlo Koledic and Luka Petrovic and Ivan Markovic and Ivan Petrovic},
      year={2024},
      eprint={2412.06080},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.06080}, 
}

Acknowledgment

This research has been funded by the H2020 project AIFORS under Grant Agreement No 952275.