An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Neil Houlsby · 12 authors total (Google Research, Brain Team); first three and final senior author Neil Houlsby listed.

Published May 2021 · ICLR 2021 (9th International Conference on Learning Representations) · Conference paper

Read the original paper Cite

Summary

This paper introduced the Vision Transformer (ViT), applying a standard Transformer encoder directly to sequences of image patches treated as tokens, with minimal vision-specific inductive biases. When pre-trained on large datasets and transferred to downstream tasks, ViT matched or exceeded state-of-the-art convolutional networks while requiring fewer computational resources to train. It demonstrated that convolutions are not necessary for strong image recognition at scale.

Key findings

A pure Transformer applied to image patches achieves excellent image classification performance.
Large-scale pre-training compensates for the lack of CNN-style inductive biases.
ViT attained results competitive with or better than top CNNs at lower pre-training compute.

Subjects & keywords

Artificial Intelligence vision transformer image classification transformers computer vision deep learning

Cite this paper

APA

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, & Neil Houlsby [12 authors total (Google Research, Brain Team); first three and final senior author Neil Houlsby listed.] (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021 (9th International Conference on Learning Representations). https://arxiv.org/abs/2010.11929

BibTeX

@inproceedings{dosovitskiy2021image,
  author    = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Neil Houlsby and {12 authors total (Google Research, Brain Team); first three and final senior author Neil Houlsby listed.}},
  title     = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  booktitle = {ICLR 2021 (9th International Conference on Learning Representations)},
  year      = {2021},
  url       = {https://arxiv.org/abs/2010.11929}
}

Related in Artificial Intelligence

AI2023

Segment Anything

Alexander Kirillov, Eric Mintun and Nikhila Ravi

This paper introduces the Segment Anything project: a promptable image segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset. SAM combines an image encoder, a flexible prompt encoder (points, boxes, masks, text), and a fast mask decoder to produce valid segmentation masks from arbitrary prompts. Trained on over 1 billion masks across 11 million images, SAM shows strong zero-shot transfer to many segmentation tasks without additional training.

IEEE/CVF International Conference on Computer Vision (ICCV) Open access

AI2023

GPT-4 Technical Report

OpenAI

This technical report describes GPT-4, a large-scale multimodal Transformer model that accepts image and text inputs and produces text outputs. The report emphasizes that GPT-4 achieves human-level performance on a range of professional and academic benchmarks, and details infrastructure and optimization methods that allowed performance to be predicted from much smaller models. For competitive and safety reasons, the report withholds architecture, dataset, and training details.

arXiv Open access

AI2023

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril and Gautier Izacard

The paper presents LLaMA, a family of foundation language models ranging from 7B to 65B parameters trained exclusively on publicly available datasets. It argues that strong performance can be reached without proprietary data and at smaller parameter counts than prior models. LLaMA-13B outperforms the much larger GPT-3 175B on most benchmarks, and LLaMA-65B is competitive with the best contemporary models such as Chinchilla-70B and PaLM-540B.

arXiv preprint (arXiv:2302.13971) Open access