High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
Summary
The paper proposes latent diffusion models (LDMs), which apply the diffusion process in the compressed latent space of a pretrained autoencoder rather than directly in pixel space, greatly reducing compute. A cross-attention conditioning mechanism enables flexible inputs such as text and bounding boxes for tasks including text-to-image generation, inpainting, and super-resolution. LDMs achieve strong or state-of-the-art results across these tasks while being far more efficient to train and sample, and this architecture underlies Stable Diffusion.
Key findings
- Running diffusion in a learned latent space drastically lowers training and inference cost while preserving image fidelity.
- A cross-attention conditioning module turns the model into a general, controllable image generator for text and other modalities.
- Sets competitive or new state-of-the-art results on tasks like inpainting, class-conditional ImageNet generation, and text-to-image synthesis.
Subjects & keywords
Cite this paper
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, & Björn Ommer (2022). High-Resolution Image Synthesis with Latent Diffusion Models. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022. https://doi.org/10.1109/CVPR52688.2022.01042
@inproceedings{rombach2022highresolution,
author = {Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
title = {High-Resolution Image Synthesis with Latent Diffusion Models},
booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022},
year = {2022},
doi = {10.1109/CVPR52688.2022.01042},
url = {https://arxiv.org/abs/2112.10752}
}