scaling laws — papers · Status Papers

AIAdvances in Neural Information Processing Systems (NeurIPS) · Mar 2022 Open access

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud and Arthur Mensch

This paper (the 'Chinchilla' paper) investigates the compute-optimal trade-off between model size and training-token count for large language models. By training over 400 models from 70M to 16B parameters on 5B to 500B tokens, the authors find that model size and training data should be scaled in roughly equal proportion—implying that prior large models were significantly undertrained. Their 70B-parameter Chinchilla model, trained on far more data under the same compute budget as Gopher, outperformed much larger models.

llm scaling laws compute-optimal chinchilla

AIarXiv · Jan 2020 Open access

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish and Tom Henighan

This paper establishes empirical scaling laws showing that the cross-entropy loss of Transformer language models follows smooth power-law relationships with model size, dataset size, and the amount of training compute. The relationships hold across many orders of magnitude, while architectural details such as width and depth have comparatively minor effects. The work provided a quantitative framework for predicting model performance and allocating compute budgets.

llm scaling laws language modeling compute