My Musings Through Stable Diffusion
Diffusing the Musings
Quick tip: Click or tap the images to view them up close.
I recently began fastai Course Part 2: a course where one dives into the deeper workings of deep learning by fully implementing stable diffusion.
In the first lesson, we play around with diffusers using the Hugging Face Diffusers library. Below are things I have noticed; my musings.
Steps
Diffusion is simply a process whereby noise is progressively removed from a noisy image. A single step can be thought of a single portion of noise being removed.
Below is the evolution of the image above in 48 steps. Each new image has less and less noise (what the diffuser thinks is noise).
It still managed to generate a pretty good image despite the misspelling of “intertwined”!
When It Doesn’t Work Well
I’ve found that a diffuser doesn’t work well when one prompts it for things, which I assume, it hasn’t “seen” or hasn’t been trained on before. It sounds obvious, but it’s really interesting when you see the result of it.
A quick Google search also doesn’t return any images matching the prompt in the top results.
CFG (Classifier Free Guidance)
Or simply known as guidance, CFG is a value which tells the diffuser how much it should stick to the prompt.
A lower guidance leads to more varied and random images that are loosely related to the prompt. A higher guidance produces more relevant images.
I’ve found that too high of a guidenace leads to images having too much contrast.
The image above shows rows with increasing levels of guidance (1, 2.5, 5, 7.5, 10, 25, 50). 7.5 is the sweetspot.
Negative Prompts
The best way to think about negative prompts is that a negative prompt guides a diffuser away from generating a certain entity.
Take the image below as an example.
I generated the image again using the exact same seed and prompt, but also used the following negative prompt, “yellow circle”.
Image to Image
Instead of starting from noise, one can make a diffuser begin from an existing image. The diffuser follows the image as guide and doesn’t match it 1 to 1.
I quickly mocked up the following image.
I input it to a diffuser with a prompt, and it output the following.
I then further generated another image from this one.
Further Adapting a Diffuser
There are two ways one can further customize a diffuser to produce desired images: textual inversion and dreambooth.
Textual Inversion
A diffuser contains a text encoder. This encoder is responsible for parsing the prompt and giving it a mathematical representation.
A text encoder can only parse according to its vocabulary. If it encounters words not in its vocabulary, the diffuser will be unable to produce an image relevant to the prompt.
In a nutshell, textual inversion adds new words to the vocabulary of the text encoder so it can parse prompts with those new words.
I managed to generate the image below by adding the word “Mr Doodle” to the vocabulary of the diffuser’s text encoder.
Dreambooth
Dreambooth is more akin to traditional fine-tuning methods. A diffuser is further trained on images one supplies to it.
So End my Musings
If you have any comments, questions, suggestions, feedback, criticisms, or corrections, please do post them down in the comment section below!