Deep Floyd IF
Overview
Download prerequisites
Setup Environment
Clone the git repo
1git clone https://github.com/deep-floyd/IF.git
cd to the repo folder
In my case:
1cd C:\Users\trima\Documents\GitHub\IF
Create the conda environment
1conda create --name IF python=3.10.10
Activate the environment
1conda activate IF
Install requirements
1pip install -r requirements.txt --upgrade
2pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
Setup Program
Download the model weights from Hugging Face
WARNING
IF-I-XL-v1.0 is ~262 GB
1git clone https://huggingface.co/DeepFloyd/IF-I-XL-v1.0.git
WARNING
IF-II-L-v1 is ~182 GB
1git clone https://huggingface.co/DeepFloyd/IF-II-L-v1.0.git
WARNING
stable-diffusion-x4-upscaler is ~26.1 GB
1git clone https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler.git
Run Deep Floyd IF
Put the code below in a file called run.py. Run it in Anaconda Prompt with python run.py
1
2import gc
3import torch
4import time
5
6torch.cuda.set_per_process_memory_fraction(0.5)
7
8def flush():
9 gc.collect()
10 torch.cuda.empty_cache()
11
12from diffusers import DiffusionPipeline
13from diffusers.utils import pt_to_pil
14
15# stage 1
16stage_1 = DiffusionPipeline.from_pretrained("./IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16, safety_checker=None)
17
18# stage 2
19stage_2 = DiffusionPipeline.from_pretrained('./IF-II-L-v1.0', text_encoder=None, variant="fp16", torch_dtype=torch.float16, safety_checker=None)
20
21# stage 3
22stage_3 = DiffusionPipeline.from_pretrained('./stable-diffusion-x4-upscaler', torch_dtype=torch.float16, safety_checker=None)
23
24# Memory management
25stage_1.enable_sequential_cpu_offload()
26stage_2.enable_model_cpu_offload()
27stage_3.enable_model_cpu_offload()
28
29# prompt
30prompt = 'an anime girl wearing a shirt that says "hello world"'
31
32# text embeds
33prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
34
35# seed settings
36time_seed = int(time.time())
37generator = torch.manual_seed(time_seed)
38
39# stage 1
40image = stage_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt").images
41pt_to_pil(image)[0].save("./if_stage_I.png")
42
43del stage_1
44flush()
45
46# stage 2
47image = stage_2(
48 image=image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
49).images
50pt_to_pil(image)[0].save("./if_stage_II.png")
51
52del stage_2
53flush()
54
55# stage 3
56image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
57image[0].save("./if_stage_III.png")
Conclusion
My takeaways from Deep Floyd IF:
- The 16GB of VRAM in my RTX 4080 isn't enough to run the third stage, so the largest output this implementation can make is 256x256
- Deep Floyd IF has extremely slow inference times, upwards of two mintues per 256x256 image. I've played around a bit with memory management but don't know enough about Pytorch to get VRAM usage under 16GB. I got stage 3 working in CPU mode only, which sent inference times soaring over 40 minutes per 1024x1024 image.
- Community adoption has been slow, probably because of slow inference times
- Not really seeing an advantage of this over Stable Diffusion + ControlNet