Deep Floyd IF

Overview

Download prerequisites

  1. Miniconda
  2. Git

Setup Environment

Clone the git repo

1git clone https://github.com/deep-floyd/IF.git

cd to the repo folder

In my case:

1cd C:\Users\trima\Documents\GitHub\IF

Create the conda environment

1conda create --name IF python=3.10.10

Activate the environment

1conda activate IF

Install requirements

1pip install -r requirements.txt --upgrade
2pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

Setup Program

Download the model weights from Hugging Face

WARNING

IF-I-XL-v1.0 is ~262 GB

1git clone https://huggingface.co/DeepFloyd/IF-I-XL-v1.0.git
WARNING

IF-II-L-v1 is ~182 GB

1git clone https://huggingface.co/DeepFloyd/IF-II-L-v1.0.git
WARNING

stable-diffusion-x4-upscaler is ~26.1 GB

1git clone https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler.git

Run Deep Floyd IF

Put the code below in a file called run.py. Run it in Anaconda Prompt with python run.py

 1
 2import gc
 3import torch
 4import time
 5
 6torch.cuda.set_per_process_memory_fraction(0.5)
 7
 8def flush():
 9    gc.collect()
10    torch.cuda.empty_cache()
11
12from diffusers import DiffusionPipeline
13from diffusers.utils import pt_to_pil
14
15# stage 1
16stage_1 = DiffusionPipeline.from_pretrained("./IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16, safety_checker=None)
17
18# stage 2
19stage_2 = DiffusionPipeline.from_pretrained('./IF-II-L-v1.0', text_encoder=None, variant="fp16", torch_dtype=torch.float16, safety_checker=None)
20
21# stage 3
22stage_3 = DiffusionPipeline.from_pretrained('./stable-diffusion-x4-upscaler', torch_dtype=torch.float16, safety_checker=None)
23
24# Memory management
25stage_1.enable_sequential_cpu_offload()
26stage_2.enable_model_cpu_offload()
27stage_3.enable_model_cpu_offload()
28
29# prompt
30prompt = 'an anime girl wearing a shirt that says "hello world"'
31
32# text embeds
33prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
34
35# seed settings
36time_seed = int(time.time())
37generator = torch.manual_seed(time_seed)
38
39# stage 1
40image = stage_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt").images
41pt_to_pil(image)[0].save("./if_stage_I.png")
42
43del stage_1
44flush()
45
46# stage 2
47image = stage_2(
48    image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
49).images
50pt_to_pil(image)[0].save("./if_stage_II.png")
51
52del stage_2
53flush()
54
55# stage 3
56image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
57image[0].save("./if_stage_III.png")

Conclusion

My takeaways from Deep Floyd IF:

  • The 16GB of VRAM in my RTX 4080 isn't enough to run the third stage, so the largest output this implementation can make is 256x256
  • Deep Floyd IF has extremely slow inference times, upwards of two mintues per 256x256 image. I've played around a bit with memory management but don't know enough about Pytorch to get VRAM usage under 16GB. I got stage 3 working in CPU mode only, which sent inference times soaring over 40 minutes per 1024x1024 image.
  • Community adoption has been slow, probably because of slow inference times
  • Not really seeing an advantage of this over Stable Diffusion + ControlNet