Machine learning and AI — the architectures, models and methods behind modern deep learning, from vision and language to protein folding and game-play.
25 papers in this field
AIIEEE/CVF International Conference on Computer Vision (ICCV) · Apr 2023 Open access
This paper introduces the Segment Anything project: a promptable image segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset. SAM combines an image encoder, a flexible prompt encoder (points, boxes, masks, text), and a fast mask decoder to produce valid segmentation masks from arbitrary prompts. Trained on over 1 billion masks across 11 million images, SAM shows strong zero-shot transfer to many segmentation tasks without additional training.
This technical report describes GPT-4, a large-scale multimodal Transformer model that accepts image and text inputs and produces text outputs. The report emphasizes that GPT-4 achieves human-level performance on a range of professional and academic benchmarks, and details infrastructure and optimization methods that allowed performance to be predicted from much smaller models. For competitive and safety reasons, the report withholds architecture, dataset, and training details.
The paper presents LLaMA, a family of foundation language models ranging from 7B to 65B parameters trained exclusively on publicly available datasets. It argues that strong performance can be reached without proprietary data and at smaller parameter counts than prior models. LLaMA-13B outperforms the much larger GPT-3 175B on most benchmarks, and LLaMA-65B is competitive with the best contemporary models such as Chinchilla-70B and PaLM-540B.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, et al.
The paper introduces LoRA, a parameter-efficient fine-tuning method that keeps the pretrained model weights frozen and instead learns small trainable low-rank decomposition matrices injected into the Transformer layers. This drastically cuts the number of trainable parameters and optimizer memory needed to adapt very large models to downstream tasks. The authors show LoRA matches or exceeds full fine-tuning quality across several models including GPT-3 175B while adding no extra inference latency.
Jordan Hoffmann, Sebastian Borgeaud and Arthur Mensch
This paper (the 'Chinchilla' paper) investigates the compute-optimal trade-off between model size and training-token count for large language models. By training over 400 models from 70M to 16B parameters on 5B to 500B tokens, the authors find that model size and training data should be scaled in roughly equal proportion—implying that prior large models were significantly undertrained. Their 70B-parameter Chinchilla model, trained on far more data under the same compute budget as Gopher, outperformed much larger models.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, et al.
The paper (InstructGPT) shows how to align language models with user intent by fine-tuning GPT-3 on human-written demonstrations and then optimizing against a learned reward model with reinforcement learning from human feedback (RLHF). Human evaluators preferred outputs from a 1.3B-parameter InstructGPT model over the 175B GPT-3 model, despite the large size difference. The approach improves truthfulness and reduces toxic generations while causing only minimal regressions on standard NLP benchmarks.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, et al.
The paper shows that prompting a large language model with a few exemplars that include intermediate reasoning steps (a 'chain of thought') substantially improves its ability to solve multi-step reasoning problems. This reasoning ability emerges only in sufficiently large models and requires no fine-tuning. Across arithmetic, commonsense, and symbolic reasoning tasks, chain-of-thought prompting produces large gains, including a new state of the art on the GSM8K math word-problem benchmark.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser and Björn Ommer
The paper proposes latent diffusion models (LDMs), which apply the diffusion process in the compressed latent space of a pretrained autoencoder rather than directly in pixel space, greatly reducing compute. A cross-attention conditioning mechanism enables flexible inputs such as text and bounding boxes for tasks including text-to-image generation, inpainting, and super-resolution. LDMs achieve strong or state-of-the-art results across these tasks while being far more efficient to train and sample, and this architecture underlies Stable Diffusion.
John Jumper, Richard Evans, Alexander Pritzel, David Silver, Oriol Vinyals and Demis Hassabis
The paper introduces AlphaFold2, a deep-learning system that predicts three-dimensional protein structures directly from amino-acid sequence with near-experimental accuracy. It combines a novel attention-based Evoformer over multiple sequence alignments and pairwise representations with an end-to-end structure module that produces atomic coordinates. AlphaFold won the CASP14 assessment by a wide margin, delivering atomic-level accuracy for the majority of targets.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov and Neil Houlsby
This paper introduced the Vision Transformer (ViT), applying a standard Transformer encoder directly to sequences of image patches treated as tokens, with minimal vision-specific inductive biases. When pre-trained on large datasets and transferred to downstream tasks, ViT matched or exceeded state-of-the-art convolutional networks while requiring fewer computational resources to train. It demonstrated that convolutions are not necessary for strong image recognition at scale.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, et al.
The paper presents CLIP, which learns visual representations by contrastively matching images to their natural-language captions over a 400-million-pair web dataset. The pretrained model can be applied zero-shot to many downstream vision tasks by framing class labels as text prompts, without task-specific fine-tuning. It matches the accuracy of a supervised ImageNet ResNet-50 zero-shot and transfers robustly across a broad benchmark suite.
This paper presented GPT-3, an autoregressive language model with 175 billion parameters, and studied its ability to perform tasks from natural-language descriptions and a few examples without gradient updates (in-context learning). Scaling the model dramatically improved few-shot performance across many NLP benchmarks, sometimes approaching fine-tuned systems. The authors also examined limitations, data contamination, and broader societal impacts of large language models.
The paper introduces denoising diffusion probabilistic models (DDPMs), a class of latent-variable generative models trained to reverse a fixed Gaussian noising process. It establishes a connection between diffusion models and denoising score matching with Langevin dynamics, and proposes a simplified, reweighted training objective. The resulting models produce high-quality image samples, achieving competitive log-likelihoods and a strong FID on CIFAR-10.
This paper establishes empirical scaling laws showing that the cross-entropy loss of Transformer language models follows smooth power-law relationships with model size, dataset size, and the amount of training compute. The relationships hold across many orders of magnitude, while architectural details such as width and depth have comparatively minor effects. The work provided a quantitative framework for predicting model performance and allocating compute budgets.
This paper introduces T5 (Text-to-Text Transfer Transformer), a framework that casts every NLP problem—translation, classification, question answering, summarization—as a text-to-text task with a unified model, objective, and decoding procedure. The authors conduct a large-scale empirical study comparing pre-training objectives, architectures, datasets, and transfer strategies, and release the C4 corpus. Scaling the model up to 11 billion parameters achieved state-of-the-art results on many benchmarks.
Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova
BERT is a language representation model pre-trained on large unlabeled corpora using masked language modeling and next-sentence prediction, yielding deeply bidirectional contextual representations. The pre-trained model can be fine-tuned with a single additional output layer to achieve strong performance across diverse downstream tasks. It set new state-of-the-art results on eleven NLP benchmarks at the time of publication.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever
This paper introduces GPT-2, a 1.5-billion-parameter Transformer language model trained on a large web-text corpus (WebText) with a simple next-token prediction objective. It demonstrates that a sufficiently large language model can perform many NLP tasks in a zero-shot setting, without task-specific training data or fine-tuning. The work argued that unsupervised language modeling at scale implicitly learns to perform downstream tasks from naturally occurring demonstrations.
David Silver, Julian Schrittwieser, Karen Simonyan and Demis Hassabis
This paper presented AlphaGo Zero, which learned to play Go solely through self-play reinforcement learning without any human game data or handcrafted features, using a single neural network and a simpler tree search. Starting from random play, it discovered Go knowledge and novel strategies on its own. AlphaGo Zero surpassed all previous versions of AlphaGo, including the one that beat Lee Sedol.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al.
The paper introduced the Transformer, a sequence-transduction architecture based entirely on attention mechanisms, dispensing with the recurrence and convolutions used by prior state-of-the-art models. By relying on multi-head self-attention, the model is more parallelizable and trains substantially faster, while achieving new state-of-the-art results on machine translation. The architecture became the foundation for subsequent large language models and much of modern deep learning.
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun
The authors introduced a residual learning framework that reformulates network layers to learn residual functions with reference to their inputs (via identity 'shortcut' connections), making very deep networks substantially easier to optimize. They showed that such residual networks gain accuracy from greatly increased depth, evaluating models up to 152 layers deep on ImageNet at lower complexity than VGG networks. The approach won first place in the ILSVRC 2015 classification task and yielded large improvements on detection and localization benchmarks.
David Silver, Aja Huang, Chris J. Maddison and Demis Hassabis
This paper introduced AlphaGo, a system combining deep convolutional neural networks (policy and value networks) trained by supervised learning from human games and reinforcement learning by self-play, integrated with Monte Carlo tree search. The networks reduce the breadth and depth of the search needed to evaluate Go positions. AlphaGo defeated other Go programs and became the first program to beat a professional human Go player (Fan Hui) on a full-size board.
The paper introduces U-Net, an encoder-decoder convolutional network with a contracting path to capture context and a symmetric expanding path with skip connections for precise localization. Combined with heavy data augmentation, the architecture trains end-to-end from very few annotated images. It won the ISBI cell-tracking and neuronal-structure segmentation challenges and segments a 512x512 image in under a second on a GPU.
This paper introduced batch normalization, a technique that normalizes layer inputs using mini-batch statistics to reduce internal covariate shift during training. It allows higher learning rates and less careful initialization, accelerates convergence, and acts as a regularizer. Applied to image classification networks, it dramatically reduced training steps and improved accuracy.
This paper introduced Adam, a first-order gradient-based optimization algorithm for stochastic objective functions that computes adaptive per-parameter learning rates from estimates of the first and second moments of the gradients. The method is computationally efficient, has low memory requirements, and is well suited to large-scale and noisy/sparse-gradient problems. It became one of the most widely used optimizers in deep learning.
Volodymyr Mnih, Koray Kavukcuoglu and David Silver
The paper introduced the Deep Q-Network (DQN), which combines Q-learning with deep convolutional networks and stabilizing techniques such as experience replay and a target network. Trained end-to-end from raw pixels and game scores, a single architecture and hyperparameter set learned to play 49 Atari 2600 games. It reached or exceeded the level of a professional human games tester on the majority of titles, demonstrating a general agent learning directly from high-dimensional sensory input.