Attention-Guided Masked Autoencoders for Learning Image Representations

Abstract

Masked autoencoders (MAEs) have established them- selves as a powerful pre-training method for computer vi- sion tasks. While vanilla MAEs put equal emphasis on re- constructing the individual parts of the image, we propose to inform the reconstruction process through an attention- guided loss function. By leveraging advances in unsuper- vised object discovery, we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects. Thus, we in- centivize the model to learn improved representations of the scene for a variety of tasks. Our evaluations show that our pre-trained models produce off-the-shelf represen- tations more effective than the vanilla MAE for such tasks, demonstrated by improved linear probing and k-NN clas- sification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds and changes in texture.

Method Overview

Learning architecture

Following the standard Masked Autoencoder protocol, we first mask 75% of the image patches and pass it through the encoder-decoder architecture to reconstruct the masked patches and calculate the reconstruction loss. As for the MAE, this is simply the MSE between the decoded and original pixel values. At the same time, we pass the full image through our unsupervised object discovery network, TokenCut, to obtain the raw attention mask. A scalar value between 0 and 1, the foreground score, is assigned to every patch. We then pass this attention mask through our scaling function, where its normalized, sharpened with a temperature parameter tau, and inserted into the exponential function. We will go into the initiation and effect behind this design choice later.Finally, we multiply the scaled attention mask with the reconstructed loss of the reconstructed patches, effectively performing a semantic reweighing of the loss. We find this enables the model to put increased emphasis on the foreground object.

Unsupervised Attention Maps

We leverage TokenCut to extract attention maps on all of ImageNet. TokenCut is a recent method which performances a Normalized Cut on DINO features. We visualize some qualitative examples here. These attention maps already come in the same patch-wise shape as the reconstruction loss of the masked auto encoder where each patch is assigned a single semantic foreground score.

Guidance Scheduling

We implement a cosine decay schedule for the temperature parameter. This allows the model to first focus on the global structure of the image and then gradually concentrate on the finer details. We find this to be crucial for the model to learn the correct attention maps.

Results

Linear Probing For Classication

We show that our attention guidance improves the MAE across many different datasets.

Further Tasks

Our approach also improves the learned representations for further tasks.

Ablations

Alternative Attention Sources

We experiment with different sources for attention maps. Here, we display a selection of maps. One can see that DINO maps are a bit sparse, GradCAM is imprecise, while TokenCut captures the main foreground object well.

Uses Of The Attention Map

Further, we experiment with different ways of guiding the masked auto encoder, such as inverting the attention map or masking the input. Across all variations, we find our attention guidance performs the best but a large margin.

BibTeX

@InProceedings{Sick_2025_WACV,
    author    = {Sick, Leon and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo},
    title     = {Attention-Guided Masked Autoencoders for Learning Image Representations},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)},
    month     = {February},
    year      = {2025},
    pages     = {836-846}
}