vision transformer — papers

AIICLR 2021 (9th International Conference on Learning Representations) · May 2021 Open access

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov and Neil Houlsby

This paper introduced the Vision Transformer (ViT), applying a standard Transformer encoder directly to sequences of image patches treated as tokens, with minimal vision-specific inductive biases. When pre-trained on large datasets and transferred to downstream tasks, ViT matched or exceeded state-of-the-art convolutional networks while requiring fewer computational resources to train. It demonstrated that convolutions are not necessary for strong image recognition at scale.

vision transformer image classification transformers computer vision