Sample generated dataset
Customization of text-to-image models enables users to insert new concepts or objects and generate them in unseen settings. Existing methods either rely on comparatively expensive test-time optimization or train encoders on single-image datasets without multi-image supervision, which can limit image quality. We propose a simple approach to address these challenges. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. Using this dataset, we train an encoder-based model that incorporates fine-grained visual details from reference images via a shared attention mechanism. Finally, we propose an inference technique that normalizes text and image guidance vectors to mitigate overexposure issues in sampled images. Through extensive experiments, we show that our encoder-based model, trained on SynCD, and with the proposed inference algorithm, improves upon existing encoder-based methods on standard customization benchmarks.
Results with one (top) and three reference images (bottom) as input. Our method can successfully incorporate the text prompt while preserving the object identity similar to or higher than the baseline methods. We pick the best out of four sampled images for all methods
Our dataset generation pipeline is tailored for (a) Deformable categories where we use descriptive prompts and Masked Shared Attention (MSA) among foreground objects regions of the images to promote consistent object identity. (b) Rigid object categories, where we additionally employ depth conditioning and cross-view feature warping using existing Objaverse assets to ensure 3D multiview consistency. We further use DINOv2 and aesthetic score to filter out low-quality images to create our final training dataset. The Masked Shared Attention (MSA) and feature warping mechanism is shown in the below figure.
For rigid objects, we first warp corresponding features from the first image to the other. Then, each image feature attends to itself, and the foreground object features in other images. We show an example mask, M1, used to ensure this for the first image when generating two images with the same object.
We finetune a pre-trained text-to-image model on our generated dataset. During training, we employ Shared Attention, similar to the dataset generation pipeline, between target and reference features of the image. This helps the model to incorporate fine-grained features from multiple reference images.
@inproceedings{kumari2025syncd,
title={Generating Multi-Image Synthetic Data for Text-to-Image Customization},
author={Kumari, Nupur and Yin, Xi and Zhu, Jun-Yan and Misra, Ishan and Azadi, Samaneh},
booktitle={International Conference on Computer Vision (ICCV)},
year={2025}
}