multimodal — papers · Status Papers

AIarXiv · Mar 2023 Open access

GPT-4 Technical Report

OpenAI

This technical report describes GPT-4, a large-scale multimodal Transformer model that accepts image and text inputs and produces text outputs. The report emphasizes that GPT-4 achieves human-level performance on a range of professional and academic benchmarks, and details infrastructure and optimization methods that allowed performance to be predicted from much smaller models. For competitive and safety reasons, the report withholds architecture, dataset, and training details.

llm multimodal gpt-4 foundation model

AIProceedings of the 38th International Conference on Machine Learning (ICML 2021) · Feb 2021 Open access

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al.

The paper presents CLIP, which learns visual representations by contrastively matching images to their natural-language captions over a 400-million-pair web dataset. The pretrained model can be applied zero-shot to many downstream vision tasks by framing class labels as text prompts, without task-specific fine-tuning. It matches the accuracy of a supervised ImageNet ResNet-50 zero-shot and transfers robustly across a broad benchmark suite.

clip vision-language zero-shot learning contrastive learning