RetriBooru: Leakage-Free Retrieval of Conditions from Reference Images for Subject-Driven Generation

Haoran Tang¹²

Jieren Deng¹

Zhihong Pan¹

Hao Tian¹

Pratik Chaudhari²

Xin Zhou¹

¹Baidu USA

²University of Pennsylvania

[Arxiv]

[GitHub]

[Bibtex]

Training the proposed concept composition task on our RetriBooru dataset. Different concepts to retrieve are specified in texts and passed to the retrieval encoder, which learns only from characteristic information to compose the output, guiding generation with text prompts in the U-Net.

Abstract

Diffusion-based methods have demonstrated remarkable capabilities in generating a diverse array of high-quality images, sparking interests for styled avatars, virtual try-on, and more. Previous methods use the same reference image as the target. An overlooked aspect is the leakage of the target's spatial information, style, etc. from the reference, harming the generated diversity and causing shortcuts. However, this approach continues as widely available datasets usually consist of single images not grouped by identities, and it is expensive to recollect large-scale same-identity data. Moreover, existing metrics adopt decoupled evaluation on text alignment and identity preservation, which fail at distinguishing between balanced outputs and those that over-fit to one aspect. In this paper, we propose a multi-level, same-identity dataset RetriBooru, which groups anime characters by both face and cloth identities. RetriBooru enables adopting reference images of the same character and outfits as the target, while keeping flexible gestures and actions. We benchmark previous methods on our dataset, and demonstrate the effectiveness of training with a reference image different from target (but same identity). We introduce a new concept composition task, where the conditioning encoder learns to retrieve different concepts from several reference images, and modify a baseline network RetriNet for the new task. Finally, we introduce a novel class of metrics named Similarity Weighted Diversity (SWD), to measure the overlooked diversity and better evaluate the alignment between similarity and diversity.

RetriBooru Dataset

Details of RetriBooru dataset. Left: annotations of an individual sample. Right: distributions of lengths of the ``similar'' lists, top-15 characters with most appearances, and top-30 cloth tags.

In order to train a referenced generation model for a given concept, we need to have a concepts-labeled dataset. The reason is that training on the same image both as the reference and the target risks leaking extra information such as style and geometric bias for "shortcut" convergence. Existing datasets are lacking in size, types of identities, single-tasked and may be harder to iterate on than anime (faces, for example) as shown in the below table. We use InstructBLIP-Vicuna 7B with simple yet strict heuristics to aid our annotations of clothing identities, based on Danbooru 2019 Figures datase. Please refer to our paper for detailed dataset processing and construction, and we will release RetriBooru soon on HuggingFace🤗.

Comparisons with datasets used in other literature.
Dataset	Size	Category	Multi-images per identity	Concept composition	Data source
DreamBooth	≤ 180	Objects	✔		Web
BLIP-Diffusion	129M	Objects			Datasets Mixture
FastComposer	70K	Human Faces		✔	FFHQ-wild
CustomConcept101	642	Objects and Faces	✔	✔	Web
Ours	116K	Anime Figures	✔	✔	Danbooru

RetriNet Architecture

An illustration of concept composition task on RetriBooru using RetriNet. We pass reference concepts into the retrieval encoder to encode precisely only the relevant information. We pass noisy target image and prompt made with Danbooru tags to a copy of denoising U-Net. Note that we process reference images, texts, and time with corresponding frozen encoders, and we have frozen encoder and middle blocks of SD. We designate the embeddings derived from target images and prompts as Query (Q), while the embeddings from reference images and text serve as Key and Values (KV). Following the cross attention layers are zero-convolution layers.

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.