Vahan Huroyan - Recent Developments in Self-Supervised Learning for Computer Vision

Discover DINO, iBOT, and MAE in self-supervised learning for computer vision. Understand how transformer-based architectures and exponential moving average enhance image representation, enabling multimodal applications and downstream tasks.

Key takeaways

Recent developments in self-supervised learning for computer vision include DINO, iBOT, and MAE.
DINO is a new approach that avoids the need for a contrastive loss, using a self-distillation strategy instead.
iBOT is a method for learning visual representations that are invariant to data augmentations.
MAE is a masked autoencoder that learns to reconstruct missing patches of images using a transformer-based architecture.
Most of the state-of-the-art self-supervised learning methods are trained on the ImageNet dataset.
Avoiding trivial solutions, such as always predicting the same output, is crucial in self-supervised learning.
Recent methods use exponential moving average, stop gradient operator, and symmetric architecture to avoid collapsing into trivial solutions.
Evaluation of the learned visual representations is typically done on downstream tasks, such as image classification, object detection, and segmentation.
The behavior of self-supervised learning methods is similar if the models are trained on different amounts of data.
The choice of augmentations, such as crop, color distortion, and Gaussian blur, can have a significant impact on performance.
Recent work has also explored using multimodal data to learn visual representations, such as CLIP, which uses text and image data.

Vahan Huroyan - Recent Developments in Self-Supervised Learning for Computer Vision

More talks