Generating Multi-Image Synthetic Data for Text-to-Image Customization

Sample generated dataset

Generating Multi-Image Synthetic Data for Text-to-Image Customization

1CMU, 2 Meta
ArXiv 2025

We propose a data generation pipeline for image customization consisting of multiple images of the same object in different contexts. Our pipeline promotes similar object identity using either explicit 3D object assets or, more implicitly, using masked shared attention across different views. Given the training data, we train a new encoder-based model for the task, which can successfully generate new compositions of a reference object using text prompts.

Abstract

Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality synthetic training dataset, SynCD, consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments we show that our final model trained on the synthetic dataset and combined with the proposed inference algorithm outperforms existing tuning-free methods on standard customization benchmarks.

Results

Qualitative comparison

Results

Results with a single reference image as input. Our method can successfully incorporate the text prompt while preserving the object identity similar to or higher than the baseline methods. We pick the best out of 4 images for all methods


Results

Results with a three reference images as input. Our method can successfully incorporate the text prompt while preserving the object identity similar to or higher than the baseline methods. We pick the best out of 4 images for all methods

Synthetic Customization Data (SynCD) Overview


Our dataset generation pipeline is tailored for (a) Deformable categories where we use descriptive prompts and Masekd Shared Attention (MSA) among foreground objects regions of the images to promote visual consistency. (b) Rigid object categories, where we additionally employ depth and cross-view warping using exising Objaverse assets to ensure 3D multiview consistency. We further use DINOv2 and aesthetic score to filter out low-quality images to create our final training dataset.

Method Overview


We finetune a pre-trained IP-Adapter based model (global feature injection) on our genereated dataset (SynCD). During training we additinally employ Masked Shared Attentin (MSA) between target and reference features of the image (fine-grained feature injection). This helps the model to incorporate more fine-grained features from multiple reference images during inference.

BibTeX

@article{kumari2025syncd,
        title={Generating Multi-Image Synthetic Data for Text-to-Image Customization},
        author={Kumari, Nupur and Yin, Xi and Zhu, Jun-Yan and Misra, Ishan and Azadi, Samaneh},
        journal={},
        year={2025}
      }

Acknowledgements

We thank Kangle Deng, Gaurav Parmar, and Maxwell Jones for their helpful comments and discussion and Ruihan Gao and Ava Pun for proofreading the draft. This work was partly done by Nupur Kumari during the Meta internship. The project was partly supported by the Packard Fellowship, National AI Research Lab (South Korea), NSF IIS-2239076, and NSF ISS-2403303.