CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation

1Ulm University, 2TU Vienna


TL;DR: We leverage 3D information to cut instances along their actual 3D boundaries and extract Spatial Importance to make the semantic graph 3D-aware. We then train a class-agnostic detector on the pseudo-masks, augmented with Spatial Confidence. This allows use to capture the signal quality of the pseudo-masks and improve the detector's performance. We apply Spatial Confidence for confident copy-paste augmentation, alpha-blending and a Spatial Confidence Soft Target Loss. With our contributions combined, we outperform relevant baselines on several benchmarks.

Abstract

Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.

Method Overview

LocalCut: Cutting Instances in 3D

Our method improves instance segmentation by combining semantic and 3D spatial information. While existing methods often fail to separate connected, semantically similar instances in 2D, we address this by leveraging a point cloud representation derived from monocular depth estimation. We use NCut to define an initial semantic partition. Further, we project the depth map into a 3D point cloud and construct a nearest-neighbor graph to capture local geometric properties. Instance boundaries are identified using MinCut, which partitions the graph by minimizing edge weights, with foreground and background points defined based on NCut's semantic analysis: The source and sink points are set to the semantically most forward and backward points, respectively.

Spatial Importance

We compute a Spatial Importance map based on depth data, assigning higher importance to regions with rapid depth changes, which likely indicate object boundaries. Inspired by unsharp masking techniques, we apply Gaussian blurring to the depth map and subtract it from the original, revealing high-frequency depth components. These values are normalized and used to sharpen the semantic similarities in the affinity matrix, emphasizing boundaries critical for segmentation. This improved semantic mask ensures LocalCut effectively identifies precise 3D instance boundaries.

Spatial Confidence

Our method refines pseudo-masks used for CAD training by leveraging 3D information to assess their quality through Spatial Confidence maps, which capture certainty for individual patches along 3D boundaries. This process builds on LocalCut, where segmentation sensitivity to thresholds indicates boundary quality. By sampling variations of the threshold, we compute confidence scores that guide mask refinement during training.

Confident Copy-Paste Selection: We enhance copy-paste augmentation by selecting only the highest-quality masks based on their Spatial Confidence scores. This reduces ambiguity, ensuring only reliable masks are used, resulting in improved CAD performance.

Confidence Alpha-Blending: Instead of binary copy-paste augmentation, we use Spatial Confidence to alpha-blend uncertain regions into the target image. Pixels with high confidence are fully pasted, while lower-confidence areas are partially blended, creating more natural augmentations and further improving model training.

Spatial Confidence Soft Target Loss: We modify the CAD loss function to incorporate patch-level confidence from the Spatial Confidence maps. Each mask region’s loss is weighted by its confidence, providing a more precise learning signal. This re-weighting ensures the model better reflects the reliability of the pseudo-mask regions, leading to more accurate instance segmentation.

By integrating Spatial Confidence maps into these strategies, we significantly enhance the training process, yielding cleaner masks and better segmentation performance.

Results

We train a Cascade Mask R-CNN using our generated IN1K pseudo-masks together with Spatial Confidence and further refine the model with 1 round of self-training.

Zero-Shot Unsupervised Instance Segmentation

For unsupervised instance segmentation, we outperform the state-of-the-art on the COCO val2017 and COCO20K datasets across a wide range of metrics.

Zero-Shot Unsupervised Object Detection

We evaluate our method on different object detection datasets and find that we outperform our baseline, CutLER, on all benchmarks, as well as CuVLER on the most evaluations.

In-Domain Unsupervised Instance Segmentation

We further self-train our model on the COCO target domain and find that we outperform the state-of-the-art on the COCO val2017, COCO20K and LVIS datasets across a wide range of metrics.

More Qualitative Results

Ablations

Effect Of Our Contributions

We observe that combination of LocalCut and Spatial Importance shows a strong synergistic effect: Spatial Importance sharpening improves semantics, enabling LocalCut to more effectively identify 3D object boundaries. Spatial Confidence further refines performance, with a detailed analysis provided in a later section. Each technical contribution incrementally improves the model’s overall performance.

Depth Sources

We evaluate the applicability of different zero-shot monocular depth estimators for our method, leveraging recent advancements in the field. To cover diverse approaches, we select one model from each category: ZoeDepth, trained on a mix of metric depth datasets; Marigold, which learns depth from synthetic data; and Kick, Back & Relax, which uses self-supervision on SlowTV videos. For each estimator, we predict depth for the entire IN1K training set and use our best-performing model configuration with DiffNCuts as the feature extractor for pseudo-mask generation, varying only the depth source. This allows us to analyze the impact of depth estimation strategies on our approach.

BibTeX

@misc{sick2024cuts3d,
      title={CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation},
      author={Leon Sick and Dominik Engel and Sebastian Hartwig and Pedro Hermosilla and Timo Ropinski},
      year={2024},
      eprint={2411.16319},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.16319},
}