DreamBooth CUDA out of memory error when no other processes running and GPU usage 0-1%

Question

I'm trying to train a dreambooth model. The first attempt with this data set went fine.

I'm now getting the following error. The only thing I have done is increase the steps to 3400, from a previous 2180.

it uploads and saves all the files, then...


  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Install xformers
2023-04-20 20:28:52.979827: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.9/dist-packages/accelerate/accelerator.py:249: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of  Accelerate. Use `project_dir` instead.
  warnings.warn(

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/paths.py:105: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  warn(
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8013'), PosixPath('http'), PosixPath('//172.28.0.1')}
  warn(
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-t4-s-2iz70up4e0mon --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true'), PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https')}
  warn(
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/paths.py:27: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  warn(
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/usr/local/lib/python3.9/dist-packages/diffusers/configuration_utils.py:203: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Caching latents: 100% 50/50 [00:12<00:00,  4.02it/s]
04/20/2023 20:29:29 - INFO - __main__ - ***** Running training *****
04/20/2023 20:29:29 - INFO - __main__ -   Num examples = 50
04/20/2023 20:29:29 - INFO - __main__ -   Num batches each epoch = 50
04/20/2023 20:29:29 - INFO - __main__ -   Num Epochs = 68
04/20/2023 20:29:29 - INFO - __main__ -   Instantaneous batch size per device = 1
04/20/2023 20:29:29 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
04/20/2023 20:29:29 - INFO - __main__ -   Gradient Accumulation steps = 1
04/20/2023 20:29:29 - INFO - __main__ -   Total optimization steps = 3400
Steps:   0% 0/3400 [00:00<?, ?it/s]/usr/local/lib/python3.9/dist-packages/xformers/ops/fmha/flash.py:338: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/train_dreambooth.py:869 in <module>                                 │
│                                                                              │
│   866                                                                        │
│   867 if __name__ == "__main__":                                             │
│   868 │   args = parse_args()                                                │
│ ❱ 869 │   main(args)                                                         │
│   870                                                                        │
│                                                                              │
│ /content/train_dreambooth.py:841 in main                                     │
│                                                                              │
│   838 │   │   │   │   #         else unet.parameters()                       │
│   839 │   │   │   │   #     )                                                │
│   840 │   │   │   │   #     accelerator.clip_grad_norm_(params_to_clip, args │
│ ❱ 841 │   │   │   │   optimizer.step()                                       │
│   842 │   │   │   │   lr_scheduler.step()                                    │
│   843 │   │   │   │   optimizer.zero_grad(set_to_none=True)                  │
│   844 │   │   │   │   loss_avg.update(loss.detach_(), bsz)                   │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/accelerate/optimizer.py:134 in step   │
│                                                                              │
│   131 │   │   │   │   xm.optimizer_step(self.optimizer, optimizer_args=optim │
│   132 │   │   │   elif self.scaler is not None:                              │
│   133 │   │   │   │   scale_before = self.scaler.get_scale()                 │
│ ❱ 134 │   │   │   │   self.scaler.step(self.optimizer, closure)              │
│   135 │   │   │   │   self.scaler.update()                                   │
│   136 │   │   │   │   scale_after = self.scaler.get_scale()                  │
│   137 │   │   │   │   # If we reduced the loss scale, it means the optimizer │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/torch/cuda/amp/grad_scaler.py:370 in  │
│ step                                                                         │
│                                                                              │
│   367 │   │                                                                  │
│   368 │   │   assert len(optimizer_state["found_inf_per_device"]) > 0, "No i │
│   369 │   │                                                                  │
│ ❱ 370 │   │   retval = self._maybe_opt_step(optimizer, optimizer_state, *arg │
│   371 │   │                                                                  │
│   372 │   │   optimizer_state["stage"] = OptState.STEPPED                    │
│   373                                                                        │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/torch/cuda/amp/grad_scaler.py:290 in  │
│ _maybe_opt_step                                                              │
│                                                                              │
│   287 │   def _maybe_opt_step(self, optimizer, optimizer_state, *args, **kwa │
│   288 │   │   retval = None                                                  │
│   289 │   │   if not sum(v.item() for v in optimizer_state["found_inf_per_de │
│ ❱ 290 │   │   │   retval = optimizer.step(*args, **kwargs)                   │
│   291 │   │   return retval                                                  │
│   292 │                                                                      │
│   293 │   def step(self, optimizer, *args, **kwargs):                        │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/torch/optim/lr_scheduler.py:69 in     │
│ wrapper                                                                      │
│                                                                              │
│     66 │   │   │   │   instance = instance_ref()                             │
│     67 │   │   │   │   instance._step_count += 1                             │
│     68 │   │   │   │   wrapped = func.__get__(instance, cls)                 │
│ ❱   69 │   │   │   │   return wrapped(*args, **kwargs)                       │
│     70 │   │   │                                                             │
│     71 │   │   │   # Note that the returned function here is no longer a bou │
│     72 │   │   │   # so attributes like `__func__` and `__self__` no longer  │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/torch/optim/optimizer.py:280 in       │
│ wrapper                                                                      │
│                                                                              │
│   277 │   │   │   │   │   │   │   raise RuntimeError(f"{func} must return No │
│   278 │   │   │   │   │   │   │   │   │   │   │      f"but got {result}.")   │
│   279 │   │   │   │                                                          │
│ ❱ 280 │   │   │   │   out = func(*args, **kwargs)                            │
│   281 │   │   │   │   self._optimizer_step_code()                            │
│   282 │   │   │   │                                                          │
│   283 │   │   │   │   # call optimizer step post hooks                       │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py:115 in     │
│ decorate_context                                                             │
│                                                                              │
│   112 │   @functools.wraps(func)                                             │
│   113 │   def decorate_context(*args, **kwargs):                             │
│   114 │   │   with ctx_factory():                                            │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                               │
│   116 │                                                                      │
│   117 │   return decorate_context                                            │
│   118                                                                        │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/bitsandbytes/optim/optimizer.py:263   │
│ in step                                                                      │
│                                                                              │
│   260 │   │   │   │   │   continue                                           │
│   261 │   │   │   │   state = self.state[p]                                  │
│   262 │   │   │   │   if len(state) == 0:                                    │
│ ❱ 263 │   │   │   │   │   self.init_state(group, p, gindex, pindex)          │
│   264 │   │   │   │                                                          │
│   265 │   │   │   │   self.update_step(group, p, gindex, pindex)             │
│   266                                                                        │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py:115 in     │
│ decorate_context                                                             │
│                                                                              │
│   112 │   @functools.wraps(func)                                             │
│   113 │   def decorate_context(*args, **kwargs):                             │
│   114 │   │   with ctx_factory():                                            │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                               │
│   116 │                                                                      │
│   117 │   return decorate_context                                            │
│   118                                                                        │
│                                                                              │
│ /usr/local/lib/python3.9/dist-packages/bitsandbytes/optim/optimizer.py:401   │
│ in init_state                                                                │
│                                                                              │
│   398 │   │   │   )                                                          │
│   399 │   │   │   state["qmap1"] = self.name2qmap["dynamic"]                 │
│   400 │   │   │                                                              │
│ ❱ 401 │   │   │   state["state2"] = torch.zeros_like(                        │
│   402 │   │   │   │   p,                                                     │
│   403 │   │   │   │   memory_format=torch.preserve_format,                   │
│   404 │   │   │   │   dtype=torch.uint8,                                     │
╰──────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 14.75 
GiB total capacity; 8.12 GiB already allocated; 12.81 MiB free; 8.27 GiB 
reserved in total by PyTorch) If reserved memory is >> allocated memory try 
setting max_split_size_mb to avoid fragmentation.  See documentation for Memory 
Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0% 0/3400 [00:05<?, ?it/s]
Reshaping encoder.mid.attn_1.q.weight for SD format
Reshaping encoder.mid.attn_1.k.weight for SD format
Reshaping encoder.mid.attn_1.v.weight for SD format
Reshaping encoder.mid.attn_1.proj_out.weight for SD format
Reshaping decoder.mid.attn_1.q.weight for SD format
Reshaping decoder.mid.attn_1.k.weight for SD format
Reshaping decoder.mid.attn_1.v.weight for SD format
Reshaping decoder.mid.attn_1.proj_out.weight for SD format
[*] Converted ckpt saved at /content/stable_diffusion_weights/output/2180/model.ckpt
[*] WEIGHTS_DIR=/content/stable_diffusion_weights/output/2180
Dreambooth completed successfully. It took 3.2 minutes.
Model saved to /content/drive/MyDrive/Dreambooth_model/model3.ckpt

nvidia-smi shows

Thu Apr 20 21:28:49 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 106... WDDM  | 00000000:01:00.0  On |                  N/A |
| 49%   31C    P0    24W / 120W |    233MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1116    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A      1608    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      8016    C+G   ...e\PhoneExperienceHost.exe    N/A      |
|    0   N/A  N/A     10008    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A     10744    C+G   ...BYTE\AppCenter\ApCent.exe    N/A      |
|    0   N/A  N/A     10876    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     11568    C+G   ...nputApp\TextInputHost.exe    N/A      |
|    0   N/A  N/A     12356    C+G   ...oft\OneDrive\OneDrive.exe    N/A      |
|    0   N/A  N/A     13052    C+G   ...batNotificationClient.exe    N/A      |
|    0   N/A  N/A     13192    C+G   ...IR iCUE Software\iCUE.exe    N/A      |
+-----------------------------------------------------------------------------+

DreamBooth CUDA out of memory error when no other processes running and GPU usage 0-1%

0 Answers0