0

I am running this code in a computer with rtx 3090ti github_code. However, the code raises an error with first forward layer. Although, the code succesfully runs on cpu. The stack trace:

Traceback (most recent call last):
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 322, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 136, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/tekre/.vscode/extensions/ms-python.python-2022.18.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/home/tekre/Desktop/video_captioning_studies/HMN/main.py", line 37, in <module>
    model = train_fn(cfgs, cfgs.model_name, model, hungary_matcher, train_loader, valid_loader, device)
  File "/home/tekre/Desktop/video_captioning_studies/HMN/train.py", line 66, in train_fn
    preds, objects_pending, action_pending, video_pending = model(objects, object_masks, feature2ds, feature3ds, numberic_caps)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 95, in forward
    objects_feats, action_feats, video_feats, objects_semantics, action_semantics, video_semantics = self.forward_encoder(objects_feats, objects_mask, feature2ds, feature3ds)
  File "/home/tekre/Desktop/video_captioning_studies/HMN/models/caption_models/hierarchical_model.py", line 57, in forward_encoder
    objects_feats, objects_semantics = self.entity_level(feature2ds, feature3ds, objects, objects_mask)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tekre/Desktop/video_captioning_studies/HMN/models/encoders/entity_level.py", line 53, in forward
    features_2d = self.feature2d_proj(features_2d.view(-1, features_2d.shape[-1]))
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
    exponential_average_factor, self.eps)
  File "/home/tekre/miniconda3/envs/hmn_env/lib/python3.7/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I installed my environment as the github repo instructed. Do i need to additionally install cudnn package because pytorch handles it in environment. I am putting this question here because there is not much response there.

talonmies
  • 70,661
  • 34
  • 192
  • 269
tealy
  • 132
  • 1
  • 12
  • I don’t believe that code can work with your GPU without a lot of work. The GitHub site says you need Pytorch 1.4. The maximum version CUDA version for official releases of Pytorch is CUDA 10.2 and lacks any binary support for your GPU architecture. There is no support for your GPU, it is too new – talonmies Nov 03 '22 at 13:27

2 Answers2

1

In my case, I was running out of GPU memory when it hits the loss.backward() call. I was running a LSTM architecture. As far as I know, pytorch can't distinguish the out-of-memory error currently. I checked the GPU usage using nvidia-smi and noticed the spike before the RuntimeError.

To further debug the issue, please see the discussion here.

user3503711
  • 1,623
  • 1
  • 21
  • 32
0

In my case this error was caused by a mismatch between the version of Cuda I was using (11.7) and the version of Cuda pytorch was installed to work with.

On installing the correct version from the Pytorch Installation Page, I was able to run the code.

ijuneja
  • 454
  • 1
  • 5
  • 17