High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer

Published 20 December 2021 · IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022 · Conference paper

Read the original paper Cite

Summary

The paper proposes latent diffusion models (LDMs), which apply the diffusion process in the compressed latent space of a pretrained autoencoder rather than directly in pixel space, greatly reducing compute. A cross-attention conditioning mechanism enables flexible inputs such as text and bounding boxes for tasks including text-to-image generation, inpainting, and super-resolution. LDMs achieve strong or state-of-the-art results across these tasks while being far more efficient to train and sample, and this architecture underlies Stable Diffusion.

Key findings

Running diffusion in a learned latent space drastically lowers training and inference cost while preserving image fidelity.
A cross-attention conditioning module turns the model into a general, controllable image generator for text and other modalities.
Sets competitive or new state-of-the-art results on tasks like inpainting, class-conditional ImageNet generation, and text-to-image synthesis.

Subjects & keywords

Artificial Intelligence latent diffusion image synthesis generative models computer vision stable diffusion

Cite this paper

APA

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, & Björn Ommer (2022). High-Resolution Image Synthesis with Latent Diffusion Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022. https://doi.org/10.1109/CVPR52688.2022.01042

BibTeX

@inproceedings{rombach2022highresolution,
  author    = {Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
  title     = {High-Resolution Image Synthesis with Latent Diffusion Models},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022},
  year      = {2022},
  doi       = {10.1109/CVPR52688.2022.01042},
  url       = {https://arxiv.org/abs/2112.10752}
}

Related in Artificial Intelligence

AI2023

Segment Anything

Alexander Kirillov, Eric Mintun and Nikhila Ravi

This paper introduces the Segment Anything project: a promptable image segmentation task, the Segment Anything Model (SAM), and the SA-1B dataset. SAM combines an image encoder, a flexible prompt encoder (points, boxes, masks, text), and a fast mask decoder to produce valid segmentation masks from arbitrary prompts. Trained on over 1 billion masks across 11 million images, SAM shows strong zero-shot transfer to many segmentation tasks without additional training.

IEEE/CVF International Conference on Computer Vision (ICCV) Open access

AI2023

GPT-4 Technical Report

OpenAI

This technical report describes GPT-4, a large-scale multimodal Transformer model that accepts image and text inputs and produces text outputs. The report emphasizes that GPT-4 achieves human-level performance on a range of professional and academic benchmarks, and details infrastructure and optimization methods that allowed performance to be predicted from much smaller models. For competitive and safety reasons, the report withholds architecture, dataset, and training details.

arXiv Open access

AI2023

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril and Gautier Izacard

The paper presents LLaMA, a family of foundation language models ranging from 7B to 65B parameters trained exclusively on publicly available datasets. It argues that strong performance can be reached without proprietary data and at smaller parameter counts than prior models. LLaMA-13B outperforms the much larger GPT-3 175B on most benchmarks, and LLaMA-65B is competitive with the best contemporary models such as Chinchilla-70B and PaLM-540B.

arXiv preprint (arXiv:2302.13971) Open access