Notes on running Stable Diffusion locally

Recently, I started trying to do MLops, not for a career, but just out of interest. I wanted to see the easiest way to do this, and was surprised at how easy the whole process was. I used Easy Diffusion, which is super easy to run (download, extract, run ./start.sh) and from there, it’ll download the core Stable Diffusion .cpkt - which I guess is a checkpoint file (I couldn’t figure out what that means just yet, but it seems in longform computations, checkpoint files are great for exporting some form of pickle?). Safetensors is another file format I encountered, it seems to be a safer alternative to pickling - it serialises the vectors and saves them.

💡
Mac M-series architecture doesn’t have VRAM and instead directs tensor computation directly to SoC.
  • Seed: It is the initial value used to generate random numbers. In this case, the seed is set to 0, which ensures that the generated images will be the same every time the process is run with the same parameters.
  • Random Number of Images: The program will generate 1 image in total.
  • Model: The model used for image generation is "cyberrealistic_v32."
  • Custom VAE: There is no custom Variational Autoencoder (VAE) used in this setup.
  • Sampler: The program uses the "Euler Ancestral" sampling technique.
  • Image Size: The generated images will have a resolution of 512 pixels width and 512 pixels height.
  • Inference Steps: The number of inference steps during the generation process is 25.
  • Guidance Scale: The strictness
  • Hypernetwork: There is no hypernetwork used in this setup.
  • Output Format: The generated images will be saved in JPEG format.

There are two primary types of deep generative artificial neural networks: GANs and VAEs - let’s focus on VAE for now, since we need that.

Variational Autoencoders (VAEs) VAEs are probabilistic models that learn a low-dimensional representation (latent space) of the input data. They consist of an encoder network, which maps the input data to the latent space, and a decoder network, which reconstructs the data from the latent space back to the original data space. VAEs are trained by maximizing the evidence lower bound (ELBO), which is a combination of reconstruction loss (measuring how well the decoder reconstructs the input) and a regularization term (measuring how close the latent space distribution is to a prior distribution, usually Gaussian). VAEs can generate new samples by sampling random points from the latent space and decoding them using the decoder network.

Both GANs and VAEs have their strengths and weaknesses. GANs often produce sharper and more realistic samples, but they can be challenging to train and may suffer from mode collapse (where the generator only produces a limited variety of samples). VAEs, on the other hand, have a more interpretable latent space and are better at handling missing data, but the generated samples may be less sharp compared to GANs.

Deep generative artificial neural networks have found applications in various fields, including image generation, style transfer, text-to-image synthesis, data augmentation, drug discovery, and more. They have significantly advanced the field of generative modeling and opened up new opportunities for creative and data-driven applications.