Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup.
Depth cues. Depth can be calculated both as a function of object's imaging size and vertical image position.
Vertical Canonical Representation. We introduce a novel canonical representation, ensuring consistency of vertical image position cue for varying perspective geometries, facilitating learning and generalization.
Our model predicts two canonical depth representations, each paired with an uncertainty estimate. These are transformed into depth maps via the Focal Canonical Transform and Vertical Canonical Transform, which effectively disentangle camera parameters from depth, facilitating learning and generalization across arbitrary camera setups. By leveraging carefully designed transforms and targeted data augmentation, the approach encourages two depth maps to leverage distinct cues: one emphasizing object size and the other focusing on object's vertical image position. The final depth map is then computed through probabilistic fusion.
We run extensive ablation studies on 5 autonomous driving datasets, demonstrating the improved generalization properties of proposed Vertical Canonical Representation and probabilistic fusion guided by estimated uncertanties.
Metric3D and UniDepth are not fully zero-shot, as they overfit to the aspect ratio and typically high image resolution used during training. This necessitates image resizing and padding during evaluation, resulting in processing of blank pixels without adding any meaningful detail to the image.
In contrast, GVDepth is fully resolution agnostic, adapting seamlessly to the native resolution of the input image. This feature is especially advantageous for real-time systems, where dynamically adjusting image resolution is one of the simplest and most effective ways to control computational complexity.
@article{koledic2024gvdepth,
title={GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion},
author={Karlo Koledic and Luka Petrovic and Ivan Markovic and Ivan Petrovic},
year={2024},
eprint={2412.06080},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.06080},
}
This research has been funded by the H2020 project AIFORS under Grant Agreement No 952275.