High-Resolution Image Synthesis with Latent Diffusion Models
The paper proposes latent diffusion models (LDMs), which apply the diffusion process in the compressed latent space of a pretrained autoencoder rather than directly in pixel space, greatly reducing compute. A cross-attention conditioning mechanism enables flexible inputs such as text and bounding boxes for tasks including text-to-image generation, inpainting, and super-resolution. LDMs achieve strong or state-of-the-art results across these tasks while being far more efficient to train and sample, and this architecture underlies Stable Diffusion.