Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

Published 26 February 2021 · Proceedings of the 38th International Conference on Machine Learning (ICML 2021) · Conference paper

Read the original paper Cite

Summary

The paper presents CLIP, which learns visual representations by contrastively matching images to their natural-language captions over a 400-million-pair web dataset. The pretrained model can be applied zero-shot to many downstream vision tasks by framing class labels as text prompts, without task-specific fine-tuning. It matches the accuracy of a supervised ImageNet ResNet-50 zero-shot and transfers robustly across a broad benchmark suite.

Key findings

Large-scale image-text contrastive pretraining yields broadly transferable visual representations.
Zero-shot CLIP matches a fully supervised ImageNet ResNet-50 and generalizes across roughly 30 datasets.
Natural-language supervision improves robustness to distribution shift compared with standard supervised models.

Subjects & keywords

Artificial Intelligence clip vision-language zero-shot learning contrastive learning multimodal

Cite this paper

APA

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML 2021). https://arxiv.org/abs/2103.00020

BibTeX

@inproceedings{radford2021learning,
  author    = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  title     = {Learning Transferable Visual Models From Natural Language Supervision},
  booktitle = {Proceedings of the 38th International Conference on Machine Learning (ICML 2021)},
  year      = {2021},
  url       = {https://arxiv.org/abs/2103.00020}
}

Related in Artificial Intelligence

AI2023

Segment Anything

Alexander Kirillov, Eric Mintun and Nikhila Ravi

This paper introduces the Segment Anything project: a promptable image segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset. SAM combines an image encoder, a flexible prompt encoder (points, boxes, masks, text), and a fast mask decoder to produce valid segmentation masks from arbitrary prompts. Trained on over 1 billion masks across 11 million images, SAM shows strong zero-shot transfer to many segmentation tasks without additional training.

IEEE/CVF International Conference on Computer Vision (ICCV) Open access

AI2023

GPT-4 Technical Report

OpenAI

This technical report describes GPT-4, a large-scale multimodal Transformer model that accepts image and text inputs and produces text outputs. The report emphasizes that GPT-4 achieves human-level performance on a range of professional and academic benchmarks, and details infrastructure and optimization methods that allowed performance to be predicted from much smaller models. For competitive and safety reasons, the report withholds architecture, dataset, and training details.

arXiv Open access

AI2023

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril and Gautier Izacard

The paper presents LLaMA, a family of foundation language models ranging from 7B to 65B parameters trained exclusively on publicly available datasets. It argues that strong performance can be reached without proprietary data and at smaller parameter counts than prior models. LLaMA-13B outperforms the much larger GPT-3 175B on most benchmarks, and LLaMA-65B is competitive with the best contemporary models such as Chinchilla-70B and PaLM-540B.

arXiv preprint (arXiv:2302.13971) Open access