An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Neil Houlsby · 12 authors total (Google Research, Brain Team); first three and final senior author Neil Houlsby listed.
Summary
This paper introduced the Vision Transformer (ViT), applying a standard Transformer encoder directly to sequences of image patches treated as tokens, with minimal vision-specific inductive biases. When pre-trained on large datasets and transferred to downstream tasks, ViT matched or exceeded state-of-the-art convolutional networks while requiring fewer computational resources to train. It demonstrated that convolutions are not necessary for strong image recognition at scale.
Key findings
- A pure Transformer applied to image patches achieves excellent image classification performance.
- Large-scale pre-training compensates for the lack of CNN-style inductive biases.
- ViT attained results competitive with or better than top CNNs at lower pre-training compute.
Subjects & keywords
Cite this paper
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, & Neil Houlsby [12 authors total (Google Research, Brain Team); first three and final senior author Neil Houlsby listed.] (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021 (9th International Conference on Learning Representations). https://arxiv.org/abs/2010.11929
@inproceedings{dosovitskiy2021image,
author = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Neil Houlsby and {12 authors total (Google Research, Brain Team); first three and final senior author Neil Houlsby listed.}},
title = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
booktitle = {ICLR 2021 (9th International Conference on Learning Representations)},
year = {2021},
url = {https://arxiv.org/abs/2010.11929}
}