Retrieving Conditions from Reference Images for Diffusion Models
Haoran Tang12†
Xin Zhou1†
Jieren Deng1
Zhihong Pan1
Hao Tian1
Pratik Chaudhari2
1Baidu USA
2University of Pennsylvania
Equal Contribution

[Arxiv]
[GitHub]
[Bibtex]


An illustration of the proposed task of concept composition. Face identity information are retrieved and composed with clothing information.

Abstract

Newly developed diffusion-based techniques have showcased phenomenal abilities in producing a wide range of high-quality images, sparking considerable interest in various applications. A prevalent scenario is to generate new images based on a subject from reference images. This subject could be face identity for styled avatars, body and clothing for virtual try-on and so on. Satisfying this requirement is evolving into a field called Subject-Driven Generation. In this paper, we consider Subject-Driven Generation as a unified retrieval problem with diffusion models. We introduce a novel diffusion model architecture, named RetriNet, designed to address and solve these problems by retrieving subject attributes from reference images precisely, and filter out irrelevant information. RetriNet demonstrates impressive performance when compared to existing state-of-the-art approaches in face generation. We further propose a research and iteration friendly dataset, RetriBooru, to study a more difficult problem, concept composition. Finally, to better evaluate alignment between similarity and diversity or measure diversity that have been previously unaccounted for, we introduce a novel class of metrics named Similarity Weighted Diversity (SWD).

RetriNet Architecture



RetriNet architecture. At each timestep, we pass reference images and reference concepts into the retrieval encoder to encode precisely only the relevant information. We pass noisy target image and prompt made directly with Danbooru tags to a copy of standard stable diffusion UNet. Note that we process reference images, texts, and time with corresponding frozen encoders, and we have frozen encoder and middle blocks of SD. We connect layer outputs of retrieval encoder to SD decoder via our conjunction network consisted of cross-attentions and zero-convolutions.

RetriBooru Dataset

In order to train a referenced generation model for a given concept, we need to have a concept labeled dataset. The reason is that training on the same image both as the reference and the target risk leaking extra information such as style. Existing datasets are lacking in size, types of identities, single-tasked and may be harder to iterate on than anime (faces, for example) as shown in the below table. We use InstructBLIP-Vicuna 7B with simple yet strict heuristics to aid our annotations of clothing identities, based on Danbooru 2019 Figures datase. Please refer to our paper for detailed dataset processing and construction, and we will release RetriBooru soon on HuggingFace🤗.
Comparisons with datasets used in other literature.
Dataset Size Category Multi-images per identity Concept composition Data source
DreamBooth ≤ 180 Objects Web
BLIP-Diffusion 129M Objects Datasets Mixture
FastComposer 70K Human Faces FFHQ-wild
CustomConcept101 642 Objects and Faces Web
Ours 116K Anime Figures Danbooru

Similarity Weighted Diversity

We provide a class of metrics to combine similarity and diversity metrics and argue that this combination is crucial for referenced generation. In short, we encourage generation to have maximal diversity of facial and body details while preserving identity. See our paper for detailed formulations. For application, we evaluate diversity by asking VQA model (InstructBLIP-Vicuna 7B) about the image details, and evaluate similarity by CLIP scores and DeepFace-L2 if human faces are evaluated. Our metric will also benefit from continued improvements in the precision of similarity and diversity measurements.

Example generations given prompt and reference images.




This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.