Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch · and 19 others (DeepMind)
Summary
This paper (the 'Chinchilla' paper) investigates the compute-optimal trade-off between model size and training-token count for large language models. By training over 400 models from 70M to 16B parameters on 5B to 500B tokens, the authors find that model size and training data should be scaled in roughly equal proportion—implying that prior large models were significantly undertrained. Their 70B-parameter Chinchilla model, trained on far more data under the same compute budget as Gopher, outperformed much larger models.
Key findings
- For compute-optimal training, model size and the number of training tokens should be scaled equally.
- Many contemporary large models (e.g., GPT-3, Gopher) were substantially undertrained relative to their parameter count.
- Chinchilla (70B) trained on 1.4T tokens outperformed Gopher (280B), GPT-3 (175B), and Megatron-Turing NLG across many downstream tasks.
Subjects & keywords
Cite this paper
Jordan Hoffmann, Sebastian Borgeaud, & Arthur Mensch [and 19 others (DeepMind)] (2022). Training Compute-Optimal Large Language Models. Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/arXiv.2203.15556
@inproceedings{hoffmann2022training,
author = {Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and {and 19 others (DeepMind)}},
title = {Training Compute-Optimal Large Language Models},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2022},
doi = {10.48550/arXiv.2203.15556},
url = {https://arxiv.org/abs/2203.15556}
}