compute-optimal — papers

AIAdvances in Neural Information Processing Systems (NeurIPS) · Mar 2022 Open access

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud and Arthur Mensch

This paper (the 'Chinchilla' paper) investigates the compute-optimal trade-off between model size and training-token count for large language models. By training over 400 models from 70M to 16B parameters on 5B to 500B tokens, the authors find that model size and training data should be scaled in roughly equal proportion—implying that prior large models were significantly undertrained. Their 70B-parameter Chinchilla model, trained on far more data under the same compute budget as Gopher, outperformed much larger models.

llm scaling laws compute-optimal chinchilla