Language Models are Few-Shot Learners
This paper presented GPT-3, an autoregressive language model with 175 billion parameters, and studied its ability to perform tasks from natural-language descriptions and a few examples without gradient updates (in-context learning). Scaling the model dramatically improved few-shot performance across many NLP benchmarks, sometimes approaching fine-tuned systems. The authors also examined limitations, data contamination, and broader societal impacts of large language models.