Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan · and 7 others (OpenAI / Johns Hopkins)
Summary
This paper establishes empirical scaling laws showing that the cross-entropy loss of Transformer language models follows smooth power-law relationships with model size, dataset size, and the amount of training compute. The relationships hold across many orders of magnitude, while architectural details such as width and depth have comparatively minor effects. The work provided a quantitative framework for predicting model performance and allocating compute budgets.
Key findings
- Test loss scales as a power law in model parameters, dataset size, and training compute, spanning more than seven orders of magnitude.
- Within broad ranges, model shape (depth vs. width) matters far less than total parameter count.
- Larger models are more sample-efficient, and for a fixed compute budget optimal training favors very large models (a conclusion later refined by Chinchilla).
Subjects & keywords
Cite this paper
Jared Kaplan, Sam McCandlish, & Tom Henighan [and 7 others (OpenAI / Johns Hopkins)] (2020). Scaling Laws for Neural Language Models. arXiv. https://doi.org/10.48550/arXiv.2001.08361
@misc{kaplan2020scaling,
author = {Jared Kaplan and Sam McCandlish and Tom Henighan and {and 7 others (OpenAI / Johns Hopkins)}},
title = {Scaling Laws for Neural Language Models},
journal = {arXiv},
year = {2020},
doi = {10.48550/arXiv.2001.08361},
url = {https://arxiv.org/abs/2001.08361}
}