0

I'm trying to understand how diffusion models, like stable diffusion, work.

From what I've understood, there is a split autoencoder at the start and end, and in the middle there is a denoiser network. After encoding, the latent representation of the image is 64x64, and for example they used CLIP to generate 77x768 word embeddings.

How are these different shaped arrays concatenated in these networks?

Nyxeria
  • 363
  • 1
  • 4
  • 12

0 Answers0