ITI-GEN: Inclusive Text-to-Image Generation

Text-to-image generative models often reflect the biases of the training data, leading to unequal representations of underrepresented groups. This study investigates inclusive text-to-image generative models that generate images based on human-written prompts and ensure the resulting images are uniformly distributed across attributes of interest. Unfortunately, directly expressing the desired attributes in the prompt often leads to sub-optimal results due to linguistic ambiguity or model misrepresentation.

Hence, this paper proposes a drastically different approach that adheres to the maxim that "a picture is worth a thousand words". We show that, for some attributes, images can represent concepts more expressively than text. For instance, categories of skin tones are typically hard to specify by text but can be easily represented by example images. Building upon these insights, we propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration. The key idea is learning a set of prompt embeddings to generate images that can effectively represent all desired attribute categories. More importantly, ITI-GEN requires no model fine-tuning, making it computationally efficient to augment existing text-to-image models. Extensive experiments demonstrate that ITI-GEN largely improves over state-of-the-art models to generate inclusive images from a prompt.

Framework of ITI-GEN

Illustration of Inclusive Text-to-Image Generation with the example of two binary attributes: perceived gender and skin tone. (a) Given an input prompt, (b) ITI-GEN learns discriminative token embeddings to represent each category of every target attribute. (c) By injecting the learned tokens after the original input prompt, ITI-GEN synthesizes an inclusive prompt set that can be used to (d) sample equal (or controllable) numbers of images for any category combination. Further, our framework can be easily extended to multi-category multi-attribute scenarios of inclusive text-to-image generation. Note that, in practice, multi-category skin tones beyond {“light”, “dark”} as in this example may be challenging to specify with language.

How to Learn Inclusive Tokens with Image Guidance

Translating visual differences into text edmbedding differences. Given reference images of a multi-category attribute (e.g., skin tone), we learn the inclusive tokens by direction alignment between images and prompts, ensuring that the visual difference matches the learned language description. In addition, we propose semantic consistency loss to address language drift.

Other Domains: Scene Images

Besides human faces, we apply ITI-GEN to another domain: scene images. We claim that the inclusive text-to-image generation accounts for attributes from not only humans but also scenes, objects, or even environmental factors.

BibTeX

@inproceedings{zhang2023inclusive,
    title={{ITI-GEN}: Inclusive Text-to-Image Generation},
    author={Zhang, Cheng and Chen, Xuanbai and Chai, Siqi and Wu, Henry Chen and Lagun, Dmitry and 
            Beeler, Thabo and De la Torre, Fernando},
    booktitle={ICCV},
    year={2023},
}

ITI-GEN: Inclusive Text-to-Image Generation

ICCV 2023

Oral Presentation, Best Paper Finalist

Given a human-written prompt ("a headshot of a person"), existing text-to-image model can hardly synthesize pictures representing minority groups (e.g., people with eyeglasses in this example).

We address this problem by utilizing some reference images for inclusive text-to-image generation.

Abstract

Framework of ITI-GEN

How to Learn Inclusive Tokens with Image Guidance

Inclusive Generation with Single or Multiple Attributes

Single Attribute

Multiple Attributes

Quantitative Results

Generalizability to Other Models

Compatibilty to ControlNet

Compatibilty to InstructionPix2Pix

Other Domains: Scene Images

BibTeX

ITI-GEN: Inclusive Text-to-Image Generation

ICCV 2023

Oral Presentation, Best Paper Finalist

Given a human-written prompt ("a headshot of a person"), existing text-to-image model can hardly synthesize pictures representing minority groups (e.g., people with eyeglasses in this example). We address this problem by utilizing some reference images for inclusive text-to-image generation.

Abstract

Framework of ITI-GEN

How to Learn Inclusive Tokens with Image Guidance

Inclusive Generation with Single or Multiple Attributes

Single Attribute

Multiple Attributes

Quantitative Results

Generalizability to Other Models

Compatibilty to ControlNet

Compatibilty to InstructionPix2Pix

Other Domains: Scene Images

BibTeX

Given a human-written prompt ("a headshot of a person"), existing text-to-image model can hardly synthesize pictures representing minority groups (e.g., people with eyeglasses in this example).

We address this problem by utilizing some reference images for inclusive text-to-image generation.