We show randomly selected generated samples of our method (2nd column), DreamBooth (3rd column), and Textual Inversion (4rth column). DreamBooth fine-tunes all the parameters in the diffusion model, keeping the text transformer frozen, and uses generated images as the regularization dataset. Textual Inversion only optimizes a new word embedding token for each concept.
We show randomly generated samples of our joint training (2nd column), optimization based (3rd column) approach, and DreamBooth (4rth column) for multi-concept composition.
We show qualitative comparison of our method (2nd column) with Our (compressed model) (3rd column) and Ours (w/ Gen) i.e., generated images as regularization, on 5 prompts per dataset. The compressed model only requires 15MB storage compared to 75MB of original fine-tuned models while having similar sample quality. Our method can also be used with generated images as regularization. This leads to similar sample quality on the target images but some overfitting and worse samples on related concepts as shown in our paper.
We compare our Custom-Diffusion (joint training) with the proposed baselines of finetuning all weights in our method and sequential training on multi-concept. Finetuning all weights works worse than our method on multi-concept fine-tuning (as shown quantitatively in the main paper as well). Sequential training results in subpar results on the first concept as most visible on the Moongate + Dog setting.