I want to install "Stable Diffusion" on a paperspace virtual machine (OS: Ubuntu)
I have customized the installation to use "xformers".
- Creation of the ldm environment (via the "environment.yaml" file):
name: ldm
channels:
- pytorch
- defaults
dependencies:
- python=3.10.9
- pip=23.0.1
- numpy=1.23.1
- installation of dependencies :
conda activate ldm
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install xformers -c xformers
pip install transformers diffusers invisible-watermark
pip install omegaconf einops pytorch_lightning open_clip_torch
pip install -e .
Problems : There seems to be a type problem in the file "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/functional.py"
Logs:
python scripts/txt2img.py --prompt "a professional photograph of an astronaut riding a horse" --ckpt model/768-v-ema.ckpt --config configs/stable-diffusion/v2-inference-v.yaml --H 768 --W 768
Global seed set to 42
Loading model from model/768-v-ema.ckpt
Global Step: 140000
LatentDiffusion: Running in v-prediction mode
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is None and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 1280, context_dim is 1024 and using 20 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
DiffusionWrapper has 865.91 M params.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Creating invisible watermark encoder (see https://github.com/ShieldMnt/invisible-watermark)...
Sampling: 0%| | 0/3 [00:00<?, ?it/s---------/////// | 0/1 [00:00<?, ?it/s]
*********
<class 'torch.Tensor'>
*********
tensor([[[[0., -inf, -inf, ..., -inf, -inf, -inf],
[0., 0., -inf, ..., -inf, -inf, -inf],
[0., 0., 0., ..., -inf, -inf, -inf],
...,
[0., 0., 0., ..., 0., -inf, -inf],
[0., 0., 0., ..., 0., 0., -inf],
[0., 0., 0., ..., 0., 0., 0.]]]])
*********
data: 0%| | 0/1 [00:00<?, ?it/s]
Sampling: 0%| | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/paperspace/latent-diffusion/stablediffusion/scripts/txt2img.py", line 388, in <module>
main(opt)
File "/home/paperspace/latent-diffusion/stablediffusion/scripts/txt2img.py", line 342, in main
uc = model.get_learned_conditioning(batch_size * [""])
File "/home/paperspace/latent-diffusion/stablediffusion/ldm/models/diffusion/ddpm.py", line 665, in get_learned_conditioning
c = self.cond_stage_model.encode(c)
File "/home/paperspace/latent-diffusion/stablediffusion/ldm/modules/encoders/modules.py", line 193, in encode
return self(text)
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/paperspace/latent-diffusion/stablediffusion/ldm/modules/encoders/modules.py", line 170, in forward
z = self.encode_with_transformer(tokens.to(self.device))
File "/home/paperspace/latent-diffusion/stablediffusion/ldm/modules/encoders/modules.py", line 177, in encode_with_transformer
x = self.text_transformer_forward(x, attn_mask=self.model.attn_mask)
File "/home/paperspace/latent-diffusion/stablediffusion/ldm/modules/encoders/modules.py", line 189, in text_transformer_forward
x = r(x, attn_mask=attn_mask)
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/open_clip/transformer.py", line 242, in forward
x = q_x + self.ls_1(self.attention(q_x=self.ln_1(q_x), k_x=k_x, v_x=v_x, attn_mask=attn_mask))
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/open_clip/transformer.py", line 228, in attention
return self.attn(
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1189, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/functional.py", line 5340, in multi_head_attention_forward
attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
RuntimeError: Expected attn_mask dtype to be bool or to match query dtype, but got attn_mask.dtype: float and query.dtype: c10::BFloat16 instead.
modification of file "/home/paperspace/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/functional.py":
if attn_mask is not None:
if attn_mask.size(0) == 1 and attn_mask.dim() == 3:
attn_mask = attn_mask.unsqueeze(0)
print("---------///////")
else:
attn_mask = attn_mask.view(bsz, num_heads, -1, src_len)
print("+++++++////////")
q = q.view(bsz, num_heads, tgt_len, head_dim)
k = k.view(bsz, num_heads, src_len, head_dim)
v = v.view(bsz, num_heads, src_len, head_dim)
print("*********")
print(type(attn_mask))
print("*********")
print(attn_mask)
print("*********")