High-Fidelity Guided Image Synthesis with Latent Diffusion Models

Constrained Optimization Formulation

We propose a diffusion-based guided image synthesis framework which models the output image as the solution of a constrained optimization problem. Given a reference painting y, the constrained optimization (a) is posed so as to find a solution x with two constraints: 1) upon painting x with an autonomous painting function f we should recover a painting f(x) which is similar to reference painting y, and, 2) the output x should lie in the target data subspace defined by the text prompt (i.e., if the prompt says "photo of a red tiger" then we want the output images to be realistic photos of a red tiger instead of cartoon-like representations of the same concept). Subsequently, we show that while the computation of an exact solution for this optimization is infeasible, a practical approximation of the same can be achieved through simple gradient descent by solving the unconstrained optimization in (b).

Guided Image Synthesis from User-scribbles

Comparison with prior works. As compared to prior works, our method provides a more practical approach for improving output realism (with respect to the target domain) while still maintaining the faithfulness with the reference painting.

More Results. Our approach allows the user to easily generate realistic image outputs across a range of data modalities.

Controlling Semantics of Different Painting Regions

While performing guided image synthesis with coarse user-scribbles, the semantics of different painted regions are inferred in an implicit manner. For instance, in following figure, we note that for different outputs, the blue region can be inferred as a river, waterfall, or a valley. Also note that some painting regions might be entirely omitted (e.g., the brown strokes for the hut), if the model does not understand that the corresponding strokes indicate a distinct semantic entity.

We show that by simply defining a cross-attention based corrrespondence between input text-tokens and reference painting, the user can control the semantics of different painting regions without needing any additonal training or finetuning.

Out-of-Distribution Generalization

As shown in above, we find that the proposed approach allows for a high level of semantic control (both color composition and fine-grain semantics) over the output image attributes, while still maintaining the realism with respect to the target domain. Thus a natural question arises: Can we use the proposed approach to generate realistic photos with out-of-distribution text prompts? As shown below, we observe that both success and failure cases exist for out-of-distribution prompts. For instance, while the model was able to generate "realistic photos of cats with six legs" (note that for the same inputs prior works either generate faithful but cartoon-like outputs, or, simply generate regular cats), it shows poor performance while generating "a photo of a rat chasing a lion".

Variation with Number of Gradient Steps

A key component of the proposed method is to obtain an approximate solution for the constrained problem formulation (discussed above) using simple gradient descent. In the following figure, we visualize the variation in generated outputs as the number of gradient descent steps used for performing the optimization are increased.

As shown above, we find that for N=0, the generated outputs are sampled randomly from the subspace of outputs conditioned only on the text. As the number of gradient-descent steps increase, the model converges to a subset of solutions within the target subspace (conditioned only on text-prompt) which exhibit higher faithfulness with the provided reference painting. Please note that this behaviour is in contrast with prior works like SDEdit, wherein the increase in faithfulness to the reference is corresponded with a decrease in the realism of the generated outputs.

High-Fidelity Guided Image Synthesis with Latent Diffusion Models 🏞
CVPR 2023

Interactive Results (Stable Diffusion)

Abstract

Constrained Optimization Formulation

Guided Image Synthesis from User-scribbles

Controlling Semantics of Different Painting Regions

Out-of-Distribution Generalization

Variation with Number of Gradient Steps

BibTeX

High-Fidelity Guided Image Synthesis with Latent Diffusion Models 🏞 CVPR 2023

Interactive Results (Stable Diffusion)

Abstract

Constrained Optimization Formulation

Guided Image Synthesis from User-scribbles

Controlling Semantics of Different Painting Regions

Out-of-Distribution Generalization

Variation with Number of Gradient Steps

BibTeX

High-Fidelity Guided Image Synthesis with Latent Diffusion Models 🏞
CVPR 2023