SynCD: Generating Multi-Image Synthetic Data for Text-to-Image Customization

Abstract

Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality synthetic training dataset, SynCD, consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments we show that our final model trained on the synthetic dataset and combined with the proposed inference algorithm outperforms existing tuning-free methods on standard customization benchmarks.

Synthetic Customization Data (SynCD) Overview

Our dataset generation pipeline is tailored for (a) Deformable categories where we use descriptive prompts and Masekd Shared Attention (MSA) among foreground objects regions of the images to promote visual consistency. (b) Rigid object categories, where we additionally employ depth and cross-view warping using existing Objaverse assets to ensure 3D multiview consistency. We further use DINOv2 and aesthetic score to filter out low-quality images to create our final training dataset.

Method Overview

We finetune a pre-trained IP-Adapter based model (global feature injection) on our generated dataset (SynCD). During training, we additionally employ Masked Shared Attention (MSA) between target and reference features of the image (fine-grained feature injection). This helps the model to incorporate more fine-grained features from multiple reference images during inference.

BibTeX

@article{kumari2025syncd,
        title={Generating Multi-Image Synthetic Data for Text-to-Image Customization},
        author={Kumari, Nupur and Yin, Xi and Zhu, Jun-Yan and Misra, Ishan and Azadi, Samaneh},
        journal={ArXiv},
        year={2025}
      }

Acknowledgements

We thank Kangle Deng, Gaurav Parmar, and Maxwell Jones for their helpful comments and discussion and Ruihan Gao and Ava Pun for proofreading the draft. This work was partly done by Nupur Kumari during the Meta internship. The project was partly supported by the Packard Fellowship, National AI Research Lab (South Korea), NSF IIS-2239076, and NSF ISS-2403303.

Generating Multi-Image Synthetic Data for Text-to-Image Customization