My Musings Through Stable Diffusion

Diffusing the Musings

Diffusion

Analyzing Models

Exploring the various knobs and dials of stable diffusion.

Author

Salman Naqvi

Published

Thursday, 13 April 2023

Quick tip: Click or tap the images to view them up close.

An antique 18th century painting of a gorilla eating a plate of chips.

I recently began fastai Course Part 2: a course where one dives into the deeper workings of deep learning by fully implementing stable diffusion.

In the first lesson, we play around with diffusers using the Hugging Face Diffusers library. Below are things I have noticed; my musings.

Steps

Diffusion is simply a process whereby noise is progressively removed from a noisy image. A single step can be thought of a single portion of noise being removed.

A depiction of a ring comprised of interwined serpents, topped with a single jewel of emerald.

Below is the evolution of the image above in 48 steps. Each new image has less and less noise (what the diffuser thinks is noise).

A gif showing how the diffuser came to generating the image 'A depiction of a ring comprised of interwined serpents, topped with a single jewel of emerald.' — The gif itself has artefacts due to compression…

It still managed to generate a pretty good image despite the misspelling of “intertwined”!

When It Doesn’t Work Well

I’ve found that a diffuser doesn’t work well when one prompts it for things, which I assume, it hasn’t “seen” or hasn’t been trained on before. It sounds obvious, but it’s really interesting when you see the result of it.

A quick Google search also doesn’t return any images matching the prompt in the top results.

CFG (Classifier Free Guidance)

Or simply known as guidance, CFG is a value which tells the diffuser how much it should stick to the prompt.

A lower guidance leads to more varied and random images that are loosely related to the prompt. A higher guidance produces more relevant images.

I’ve found that too high of a guidenace leads to images having too much contrast.

A grid of images generated from the prompt, 'An antique 18th century painting of a gorilla eating a plate of chips.' Each rows shows images generated with increased guidance. — An antique 18th century painting of a gorilla eating a plate of chips.

The image above shows rows with increasing levels of guidance (1, 2.5, 5, 7.5, 10, 25, 50). 7.5 is the sweetspot.

Negative Prompts

The best way to think about negative prompts is that a negative prompt guides a diffuser away from generating a certain entity.

Take the image below as an example.

Another variation generated from the prompt, 'An antique 18th century painting of a gorilla eating a plate of chips.' There is a yellow circle around the gorilla. — An antique 18th century painting of a gorilla eating a plate of chips.

I generated the image again using the exact same seed and prompt, but also used the following negative prompt, “yellow circle”.

The same prompt used again, but the negative prompt now removes the yellow circle and adds a rectangular border around the gorilla. — Prompt: An antique 18th century painting of a gorilla eating a plate of chips. | Negative Prompt: yellow circle

Image to Image

Instead of starting from noise, one can make a diffuser begin from an existing image. The diffuser follows the image as guide and doesn’t match it 1 to 1.

I quickly mocked up the following image.

I input it to a diffuser with a prompt, and it output the following.

I then further generated another image from this one.

A low poly 3D render of a bench under a tree in a park

Further Adapting a Diffuser

There are two ways one can further customize a diffuser to produce desired images: textual inversion and dreambooth.

Textual Inversion

A diffuser contains a text encoder. This encoder is responsible for parsing the prompt and giving it a mathematical representation.

A text encoder can only parse according to its vocabulary. If it encounters words not in its vocabulary, the diffuser will be unable to produce an image relevant to the prompt.

In a nutshell, textual inversion adds new words to the vocabulary of the text encoder so it can parse prompts with those new words.

I managed to generate the image below by adding the word “Mr Doodle” to the vocabulary of the diffuser’s text encoder.

An antique 18th century painting of a gorilla eating a plate of chips in the style of Mr Doodle

Dreambooth

Dreambooth is more akin to traditional fine-tuning methods. A diffuser is further trained on images one supplies to it.

So End my Musings

If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!