Sample generated dataset
Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality synthetic training dataset, SynCD, consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments we show that our final model trained on the synthetic dataset and combined with the proposed inference algorithm outperforms existing tuning-free methods on standard customization benchmarks.
Results with a single reference image as input. Our method can successfully incorporate the text prompt while preserving the object identity similar to or higher than the baseline methods. We pick the best out of 4 images for all methods
Results with a three reference images as input. Our method can successfully incorporate the text prompt while preserving the object identity similar to or higher than the baseline methods. We pick the best out of 4 images for all methods
Our dataset generation pipeline is tailored for (a) Deformable categories where we use descriptive prompts and Masekd Shared Attention (MSA) among foreground objects regions of the images to promote visual consistency. (b) Rigid object categories, where we additionally employ depth and cross-view warping using exising Objaverse assets to ensure 3D multiview consistency. We further use DINOv2 and aesthetic score to filter out low-quality images to create our final training dataset.
We finetune a pre-trained IP-Adapter based model (global feature injection) on our genereated dataset (SynCD). During training we additinally employ Masked Shared Attentin (MSA) between target and reference features of the image (fine-grained feature injection). This helps the model to incorporate more fine-grained features from multiple reference images during inference.
@article{kumari2025syncd,
title={Generating Multi-Image Synthetic Data for Text-to-Image Customization},
author={Kumari, Nupur and Yin, Xi and Zhu, Jun-Yan and Misra, Ishan and Azadi, Samaneh},
journal={},
year={2025}
}