This paper introduces Enformer, a transformer-based deep learning model that predicts gene expression and chromatin states directly from DNA sequence by integrating regulatory information from up to ~100 kb away. By using self-attention to capture long-range interactions, it substantially improves prediction accuracy over prior convolutional models. The approach also improves prediction of the effects of non-coding genetic variants on expression.
John Jumper, Richard Evans, Alexander Pritzel, David Silver, Oriol Vinyals and Demis Hassabis
The paper introduces AlphaFold2, a deep-learning system that predicts three-dimensional protein structures directly from amino-acid sequence with near-experimental accuracy. It combines a novel attention-based Evoformer over multiple sequence alignments and pairwise representations with an end-to-end structure module that produces atomic coordinates. AlphaFold won the CASP14 assessment by a wide margin, delivering atomic-level accuracy for the majority of targets.
This paper presented RoseTTAFold, a three-track neural network that simultaneously processes one-dimensional sequence, two-dimensional residue-pair distances, and three-dimensional atomic coordinate information, with information flowing between the tracks. The method achieved protein structure prediction accuracy approaching that of AlphaFold2 while being more computationally efficient. It also demonstrated rapid generation of accurate models for protein-protein complexes.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov and Neil Houlsby
This paper introduced the Vision Transformer (ViT), applying a standard Transformer encoder directly to sequences of image patches treated as tokens, with minimal vision-specific inductive biases. When pre-trained on large datasets and transferred to downstream tasks, ViT matched or exceeded state-of-the-art convolutional networks while requiring fewer computational resources to train. It demonstrated that convolutions are not necessary for strong image recognition at scale.
The paper introduces denoising diffusion probabilistic models (DDPMs), a class of latent-variable generative models trained to reverse a fixed Gaussian noising process. It establishes a connection between diffusion models and denoising score matching with Langevin dynamics, and proposes a simplified, reweighted training objective. The resulting models produce high-quality image samples, achieving competitive log-likelihoods and a strong FID on CIFAR-10.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al.
The paper introduced the Transformer, a sequence-transduction architecture based entirely on attention mechanisms, dispensing with the recurrence and convolutions used by prior state-of-the-art models. By relying on multi-head self-attention, the model is more parallelizable and trains substantially faster, while achieving new state-of-the-art results on machine translation. The architecture became the foundation for subsequent large language models and much of modern deep learning.
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun
The authors introduced a residual learning framework that reformulates network layers to learn residual functions with reference to their inputs (via identity 'shortcut' connections), making very deep networks substantially easier to optimize. They showed that such residual networks gain accuracy from greatly increased depth, evaluating models up to 152 layers deep on ImageNet at lower complexity than VGG networks. The approach won first place in the ILSVRC 2015 classification task and yielded large improvements on detection and localization benchmarks.
David Silver, Aja Huang, Chris J. Maddison and Demis Hassabis
This paper introduced AlphaGo, a system combining deep convolutional neural networks (policy and value networks) trained by supervised learning from human games and reinforcement learning by self-play, integrated with Monte Carlo tree search. The networks reduce the breadth and depth of the search needed to evaluate Go positions. AlphaGo defeated other Go programs and became the first program to beat a professional human Go player (Fan Hui) on a full-size board.
The paper introduces U-Net, an encoder-decoder convolutional network with a contracting path to capture context and a symmetric expanding path with skip connections for precise localization. Combined with heavy data augmentation, the architecture trains end-to-end from very few annotated images. It won the ISBI cell-tracking and neuronal-structure segmentation challenges and segments a 512x512 image in under a second on a GPU.
This paper introduced batch normalization, a technique that normalizes layer inputs using mini-batch statistics to reduce internal covariate shift during training. It allows higher learning rates and less careful initialization, accelerates convergence, and acts as a regularizer. Applied to image classification networks, it dramatically reduced training steps and improved accuracy.
This paper introduced Adam, a first-order gradient-based optimization algorithm for stochastic objective functions that computes adaptive per-parameter learning rates from estimates of the first and second moments of the gradients. The method is computationally efficient, has low memory requirements, and is well suited to large-scale and noisy/sparse-gradient problems. It became one of the most widely used optimizers in deep learning.