Diffusion model text encoding concatenation dimensions

Asked Apr 08 '23 at 17:00

Active Apr 08 '23 at 17:00

Viewed 52 times

I'm trying to understand how diffusion models, like stable diffusion, work.

From what I've understood, there is a split autoencoder at the start and end, and in the middle there is a denoiser network. After encoding, the latent representation of the image is 64x64, and for example they used CLIP to generate 77x768 word embeddings.

How are these different shaped arrays concatenated in these networks?

asked Apr 08 '23 at 17:00

Nyxeria

Diffusion model text encoding concatenation dimensions

0 Answers0