How to fine tune text encoder for training a stable diffusion model with dreambooth?

Question

I am trying to train the model for a specific art with few images and I see that stable diffusion 1.5 is getting poor results for prompts, mostly the faces are messed up.

I tried another checkpoint from hugging face which is using the same stable diffusion 1.5 model but that guy fine tuned the text encoder which is getting results a lot better than what stable diffusion would generate for same set of prompts.

I searched on the internet and couldn't find more than the following on text encoders:

use arg --train_text_encoder for training text encoder
Minimum VRAM 24 GB is needed
Get best results using dreambooth and fine tuning text encoder
Use text file containing prompt for dataset

That's all i could find! there is no example on how my dataset should look like, should it be like lora where there is 0001.png and 0001.txt, what should the txt contain

What else i need to do?? Will using the arg --train_text_encoder and just changing the dataset to the one i added above work?

What are types of text encoder available? how to fine tune a text encoder for dreambooth?

How to fine tune text encoder for training a stable diffusion model with dreambooth?

0 Answers0