Training language models to follow instructions with human feedback
The paper (InstructGPT) shows how to align language models with user intent by fine-tuning GPT-3 on human-written demonstrations and then optimizing against a learned reward model with reinforcement learning from human feedback (RLHF). Human evaluators preferred outputs from a 1.3B-parameter InstructGPT model over the 175B GPT-3 model, despite the large size difference. The approach improves truthfulness and reduces toxic generations while causing only minimal regressions on standard NLP benchmarks.