Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever
Summary
The paper presents CLIP, which learns visual representations by contrastively matching images to their natural-language captions over a 400-million-pair web dataset. The pretrained model can be applied zero-shot to many downstream vision tasks by framing class labels as text prompts, without task-specific fine-tuning. It matches the accuracy of a supervised ImageNet ResNet-50 zero-shot and transfers robustly across a broad benchmark suite.
Key findings
- Large-scale image-text contrastive pretraining yields broadly transferable visual representations.
- Zero-shot CLIP matches a fully supervised ImageNet ResNet-50 and generalizes across roughly 30 datasets.
- Natural-language supervision improves robustness to distribution shift compared with standard supervised models.
Subjects & keywords
Cite this paper
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, & Ilya Sutskever (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML 2021). https://arxiv.org/abs/2103.00020
@inproceedings{radford2021learning,
author = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
title = {Learning Transferable Visual Models From Natural Language Supervision},
booktitle = {Proceedings of the 38th International Conference on Machine Learning (ICML 2021)},
year = {2021},
url = {https://arxiv.org/abs/2103.00020}
}