An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
This paper introduced the Vision Transformer (ViT), applying a standard Transformer encoder directly to sequences of image patches treated as tokens, with minimal vision-specific inductive biases. When pre-trained on large datasets and transferred to downstream tasks, ViT matched or exceeded state-of-the-art convolutional networks while requiring fewer computational resources to train. It demonstrated that convolutions are not necessary for strong image recognition at scale.