S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation

Leon Sick¹, Lukas Hoyer², Dominik Engel³, Pedro Hermosilla⁴, Timo Ropinski¹,

¹Ulm University ²Google ³KAUST ⁴TU Vienna

arXiv Code 🤗 Paper Page 🤗 Model Demo (Coming Soon)

TL;DR: We introduce S2D, a novel approach that advances unsupervised video instance segmentation by training exclusively on real video data rather than the synthetic sequences used by prior state-of-the-art methods. Our method begins with a Keymask Discovery algorithm that leverages deep motion priors to identify a sparse set of high-quality, temporally coherent masks from noisy single-frame predictions. To bridge the gap between these sparse annotations and full video segmentation, we propose a Sparse-To-Dense Distillation framework aided by a Temporal DropLoss, which enables the model to learn implicit mask propagation and generate dense label sets. This pipeline allows us to outperform existing baselines on in-domain benchmarks like YouTube-VIS and achieve superior zero-shot performance by scaling up to 13,000 real-world training videos.

Abstract

In recent years, the state-of-the-art in unsupervised video instance segmentation has heavily relied on synthetic video data, generated from object-centric image datasets such as ImageNet. However, video synthesis by artificially shifting and scaling image instance masks fails to accurately model realistic motion in videos, such as perspective changes, movement by parts of one or multiple instances, or camera motion. To tackle this issue, we propose an unsupervised video instance segmentation model trained exclusively on real video data. We start from unsupervised instance segmentation masks on individual video frames. However, these single-frame segmentations exhibit temporal noise and their quality varies through the video. Therefore, we establish temporal coherence by identifying high-quality keymasks in the video by leveraging deep motion priors. The sparse keymask pseudo-annotations are then used to train a segmentation model for implicit mask propagation, for which we propose a Sparse-To-Dense Distillation approach aided by a Temporal DropLoss. After training the final model on the resulting dense labelset, our approach outperforms the current state-of-the-art across various benchmarks.

Method Overview

Keymask Discovery

Our approach begins by addressing the limitations of frame-by-frame unsupervised segmentation, which often lacks temporal consistency. We generate initial mask proposals using a pre-trained unsupervised image segmenter. To filter out noise and jitter, we introduce a Keymask Discovery algorithm. This module leverages deep motion priors (point tracks) to assess the temporal stability of each candidate mask. By identifying masks that align consistently with the video's motion trajectory, we select a sparse set of high-confidence Keymasks to serve as reliable pseudo-ground-truth anchors, discarding the noisy predictions found in intermediate frames.

Sparse-To-Dense Distillation

Once the sparse Keymasks are extracted, the challenge lies in propagating these labels to the entire video sequence. We propose a Sparse-To-Dense Distillation framework where a video student model is trained to match the Keymasks. To handle the unlabelled frames between anchors, we introduce a Temporal DropLoss. This loss function creates a learning objective that forces the model to learn implicit mask propagation, effectively "densifying" the sparse annotations. The result is a temporally coherent, dense label set for the full video, which allows for training a robust final segmenter.

Data Scaling For Zero-Shot Learning

Unlike prior state-of-the-art methods that rely on synthetic video generation (e.g., moving static ImageNet images), S2D is designed to learn from real-world video data. This removes the "reality gap" in motion modeling. By training on raw, unlabelled videos, we can easily scale up our dataset. We demonstrate this by training on 13,000 videos from a mix of datasets. This exposure to realistic occlusions, camera movements, and deformations significantly boosts the model's ability to generalize, enabling strong zero-shot performance on unseen categories and datasets.

Experimental Results

We conduct extensive evaluations on standard Video Instance Segmentation (VIS) benchmarks, such as YouTube-VIS 2019 and YouTube-VIS 2021. We compare S2D against existing unsupervised methods, particularly those utilizing synthetic training data.

Unsupervised Video Instance Segmentation

The table above compares S2D with previous state-of-the-art unsupervised methods on the YouTube-VIS benchmarks. S2D achieves a significant performance margin over competitors. Notably, our method surpasses approaches that rely on synthetic video training, validating our hypothesis that learning from real video motion yields superior feature representations for segmentation.

Zero-Shot Unsupervised Video Instance Segmentation

To test the generalization capabilities of our model, we evaluate S2D in a zero-shot setting. Here, the model is trained on the mixture-of-datasets and evaluated directly other benchmarks without any fine-tuning. As shown in the results, S2D demonstrates remarkable transferability, outperforming baselines that fail to adapt to the complex motion patterns found in the target domains.

Qualitative Results

Ablations

Keymask Discovery Components

We analyze the effectiveness of our motion-based selection strategy compared to random selection or utilizing all available noisy masks. The results demonstrate that our Keymask Discovery is crucial; filtering for high-quality, motion-consistent anchors significantly reduces noise during the training process and leads to higher final AP scores.

Model Training Components

We further investigate the components of our distillation framework. Specifically, we ablate the Temporal DropLoss. The quantitative results confirm that without this temporal regularization, the model struggles to propagate masks correctly across frames, leading to a drop in performance. The DropLoss is essential for learning robust temporal correspondence from sparse supervision.

BibTeX

@preprint{Sick_2025_S2D,
    author    = {Sick, Leon and Hoyer, Lukas and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo},
    title     = {S2D: Sparse-To-Dense Keymask Distillation for Unsupervised Video Instance Segmentation},
    month     = {December},
    year      = {2025},
}