clip — papers · Status Papers

AIProceedings of the 38th International Conference on Machine Learning (ICML 2021) · Feb 2021 Open access

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al.

The paper presents CLIP, which learns visual representations by contrastively matching images to their natural-language captions over a 400-million-pair web dataset. The pretrained model can be applied zero-shot to many downstream vision tasks by framing class labels as text prompts, without task-specific fine-tuning. It matches the accuracy of a supervised ImageNet ResNet-50 zero-shot and transfers robustly across a broad benchmark suite.

clip vision-language zero-shot learning contrastive learning