Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe
Summary
The paper (InstructGPT) shows how to align language models with user intent by fine-tuning GPT-3 on human-written demonstrations and then optimizing against a learned reward model with reinforcement learning from human feedback (RLHF). Human evaluators preferred outputs from a 1.3B-parameter InstructGPT model over the 175B GPT-3 model, despite the large size difference. The approach improves truthfulness and reduces toxic generations while causing only minimal regressions on standard NLP benchmarks.
Key findings
- Supervised fine-tuning plus RLHF aligns model outputs with human instructions and preferences.
- A 1.3B InstructGPT model's outputs are preferred to those of 175B GPT-3 in human evaluations.
- Alignment improves truthfulness and reduces toxicity with only small 'alignment tax' on public benchmarks.
Subjects & keywords
Cite this paper
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, & Ryan Lowe (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (NeurIPS 2022). https://arxiv.org/abs/2203.02155
@inproceedings{ouyang2022training,
author = {Long Ouyang and Jeff Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul Christiano and Jan Leike and Ryan Lowe},
title = {Training language models to follow instructions with human feedback},
booktitle = {Advances in Neural Information Processing Systems 35 (NeurIPS 2022)},
year = {2022},
url = {https://arxiv.org/abs/2203.02155}
}