Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation And Sampling

CVPR 2024

1Ulm University, 2TU Vienna

TL;DR: We guide the learning of unsupervised feature representations by aligning the feature space with the 3D space using depth maps. We encourage depth-feature alignment by learning depth feature correlation and sampling feature locations equally in 3D space. Our approach further propagates the learning signal to neighboring regions of the sampled patches that go into the learning process. Combining our contributions, we outperform all relevant baselines on several datasets such as COCO, Cityscapes and Potsdam.


Traditionally, training neural networks to perform semantic segmentation requires expensive human-made annotations. But more recently, advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this, semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work, we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlating the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) exploiting farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally, we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets.

Video 🎬

Method Overview

Learning architecture

After 5-cropping the image, each crop is encoded by the DINO-pretrained ViT F to output a feature map. Using farthest point sampling (FPS), we sample the 3D space equally and convert the coordinates to select samples in the feature map. The sampled features are further transformed by the segmentation head S. For both feature maps, the correlation tensor is computed. Following, we sample the depth map at the coordinates obtained by FPS and compute a correlation tensor in the same fashion. Finally, we compute our Depth-Feature Correlation loss and combine it with the feature distillation loss from STEGO. We guide the model to learn depth-feature correlation for crops of the same image, while the feature distillation loss is also applied to k-NN-selected and random images.

Farthest-Point Sampling

We observe that random sampling can miss entire structures like trees in the first top and the plane in the bottom row. In contrast, our method meaningfully samples the depth space and selects locations across the different structures and at depth edges.

3D Local Hidden Positives

We visualize the use of depth and attention maps for local hidden positives. For this visualization, we sample the respective propagation maps at the yellow patch in the center of the crops. We observe the depth map to have sharper borders and more consistent propagation values.



We show that our model, with the help of depth, is able to overcome visual irritations such as from the building in the first row.


Cityscapes paints a similar picture. Using depth, we can prevent bad predictions caused by visual irritations such as shadows on the road.


Influence of our individual contributions

We investigate the effect of our contributions as well as their combination. We find that combining depth-feature correlation and FPS encourages a strong learning signal since many locations with different depths are sampled. Adding 3D-LHP further helps mIoU performance.


  author    = {Sick, Leon and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo},
  title     = {Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling},
  journal   = {CVPR},
  year      = {2024},