Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch · and 19 others (DeepMind)

Published 29 March 2022 · Advances in Neural Information Processing Systems (NeurIPS) · Conference paper

Summary

This paper (the 'Chinchilla' paper) investigates the compute-optimal trade-off between model size and training-token count for large language models. By training over 400 models from 70M to 16B parameters on 5B to 500B tokens, the authors find that model size and training data should be scaled in roughly equal proportion—implying that prior large models were significantly undertrained. Their 70B-parameter Chinchilla model, trained on far more data under the same compute budget as Gopher, outperformed much larger models.

Key findings

For compute-optimal training, model size and the number of training tokens should be scaled equally.
Many contemporary large models (e.g., GPT-3, Gopher) were substantially undertrained relative to their parameter count.
Chinchilla (70B) trained on 1.4T tokens outperformed Gopher (280B), GPT-3 (175B), and Megatron-Turing NLG across many downstream tasks.

Subjects & keywords

Artificial Intelligence llm scaling laws compute-optimal chinchilla

Cite this paper

APA

Jordan Hoffmann, Sebastian Borgeaud, & Arthur Mensch [and 19 others (DeepMind)] (2022). Training Compute-Optimal Large Language Models. Advances in Neural Information Processing Systems (NeurIPS). https://doi.org/10.48550/arXiv.2203.15556

BibTeX

@inproceedings{hoffmann2022training,
  author    = {Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and {and 19 others (DeepMind)}},
  title     = {Training Compute-Optimal Large Language Models},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year      = {2022},
  doi       = {10.48550/arXiv.2203.15556},
  url       = {https://arxiv.org/abs/2203.15556}
}

Related in Artificial Intelligence

AI2023

Segment Anything

Alexander Kirillov, Eric Mintun and Nikhila Ravi

This paper introduces the Segment Anything project: a promptable image segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset. SAM combines an image encoder, a flexible prompt encoder (points, boxes, masks, text), and a fast mask decoder to produce valid segmentation masks from arbitrary prompts. Trained on over 1 billion masks across 11 million images, SAM shows strong zero-shot transfer to many segmentation tasks without additional training.

IEEE/CVF International Conference on Computer Vision (ICCV) Open access

AI2023

GPT-4 Technical Report

OpenAI

This technical report describes GPT-4, a large-scale multimodal Transformer model that accepts image and text inputs and produces text outputs. The report emphasizes that GPT-4 achieves human-level performance on a range of professional and academic benchmarks, and details infrastructure and optimization methods that allowed performance to be predicted from much smaller models. For competitive and safety reasons, the report withholds architecture, dataset, and training details.

arXiv Open access

AI2023

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril and Gautier Izacard

The paper presents LLaMA, a family of foundation language models ranging from 7B to 65B parameters trained exclusively on publicly available datasets. It argues that strong performance can be reached without proprietary data and at smaller parameter counts than prior models. LLaMA-13B outperforms the much larger GPT-3 175B on most benchmarks, and LLaMA-65B is competitive with the best contemporary models such as Chinchilla-70B and PaLM-540B.

arXiv preprint (arXiv:2302.13971) Open access