ITI-GEN: Inclusive Text-to-Image Generation

1Carnegie Mellon University          2Google

ICCV 2023

Oral Presentation, Best Paper Finalist

Given a human-written prompt ("a headshot of a person"), existing text-to-image model can hardly synthesize pictures representing minority groups (e.g., people with eyeglasses in this example).

We address this problem by utilizing some reference images for inclusive text-to-image generation.


Text-to-image generative models often reflect the biases of the training data, leading to unequal representations of underrepresented groups. This study investigates inclusive text-to-image generative models that generate images based on human-written prompts and ensure the resulting images are uniformly distributed across attributes of interest. Unfortunately, directly expressing the desired attributes in the prompt often leads to sub-optimal results due to linguistic ambiguity or model misrepresentation.

Hence, this paper proposes a drastically different approach that adheres to the maxim that "a picture is worth a thousand words". We show that, for some attributes, images can represent concepts more expressively than text. For instance, categories of skin tones are typically hard to specify by text but can be easily represented by example images. Building upon these insights, we propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration. The key idea is learning a set of prompt embeddings to generate images that can effectively represent all desired attribute categories. More importantly, ITI-GEN requires no model fine-tuning, making it computationally efficient to augment existing text-to-image models. Extensive experiments demonstrate that ITI-GEN largely improves over state-of-the-art models to generate inclusive images from a prompt.

Framework of ITI-GEN

Illustration of Inclusive Text-to-Image Generation with the example of two binary attributes: perceived gender and skin tone. (a) Given an input prompt, (b) ITI-GEN learns discriminative token embeddings to represent each category of every target attribute. (c) By injecting the learned tokens after the original input prompt, ITI-GEN synthesizes an inclusive prompt set that can be used to (d) sample equal (or controllable) numbers of images for any category combination. Further, our framework can be easily extended to multi-category multi-attribute scenarios of inclusive text-to-image generation. Note that, in practice, multi-category skin tones beyond {“light”, “dark”} as in this example may be challenging to specify with language.

How to Learn Inclusive Tokens with Image Guidance

Translating visual differences into text edmbedding differences. Given reference images of a multi-category attribute (e.g., skin tone), we learn the inclusive tokens by direction alignment between images and prompts, ensuring that the visual difference matches the learned language description. In addition, we propose semantic consistency loss to address language drift.

Inclusive Generation with Single or Multiple Attributes

Single Attribute

Given a human-written prompt and the target attribute, ITI-GEN is capable of controlling the distribution of the generated data.

Multiple Attributes

Results of the combination of four binary attributes. The input prompt is “a headshot of a person”. ITI-GEN can inclusively generate images with all attribute combinations. Images across each tuple are sampled using the same random seed.

Quantitative Results

Comparison with baseline methods with (a) single attribute and (b) multiple attributes. ITI-GEN achieves competitive results for both settings. Here, Distribution Discrepancy (KL) is computed aganist the uniform distribution.

Generalizability to Other Models

Unlike personalization methods that train the embeddings for a specific model (because they use diffusion losses), the tokens learned by ITI-GEN are transferable between different models.

Compatibilty to ControlNet

ITI-GEN promotes inclusiveness of ControlNet by using the inclusive tokens trained with “a headshot of a person”.

Compatibilty to InstructionPix2Pix

Given an image and a written instruction, InstructPix2Pix (IP2P) follows the instruction to edit the image. ITI-GEN enables inclusive instruction-based image editing.

Other Domains: Scene Images

Besides human faces, we apply ITI-GEN to another domain: scene images. We claim that the inclusive text-to-image generation accounts for attributes from not only humans but also scenes, objects, or even environmental factors.


    title={{ITI-GEN}: Inclusive Text-to-Image Generation},
    author={Zhang, Cheng and Chen, Xuanbai and Chai, Siqi and Wu, Henry Chen and Lagun, Dmitry and 
            Beeler, Thabo and De la Torre, Fernando},