Pytorch fails with CUDA error: device-side assert triggered on Colab

Question

I am trying to initialize a tensor on Google Colab with GPU enabled.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

t = torch.tensor([1,2], device=device)

But I am getting this strange error.

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Even by setting that environment variable to 1 seems not showing any further details.
Anyone ever had this issue?

you should factory reset your notebook and then try. – Sudhanshu Jun 29 '21 at 12:09 — Sudhanshu, Jun 29 '21 at 12:09

score 81 · Accepted Answer · edited Aug 31 '21 at 12:46

81

While I tried your code, and it did not give me an error, I can say that usually the best practice to debug CUDA Runtime Errors: device-side assert like yours is to turn collab to CPU and recreate the error. It will give you a more useful traceback error.

Most of the time CUDA Runtime Errors can be the cause of some index mismatching so like you tried to train a network with 10 output nodes on a dataset with 15 labels. And the thing with this CUDA error is once you get this error once, you will recieve it for every operation you do with torch.tensors. This forces you to restart your notebook.

I suggest you restart your notebook, get a more accuracate traceback by moving to CPU, and check the rest of your code especially if you train a model on set of targets somewhere.

edited Aug 31 '21 at 12:46

thepurpleowl

147
4
15

answered Jun 28 '21 at 20:24

SarthakJain

1,226
6
11

19

Shape mismatch. It is quite a shame that torch doesn't tell you the error though – 3nomis Jun 29 '21 at 14:54
I receive this error, what is the problem? File "/home/tf/.virtualenvs/torch/lib/python3.6/site-packages/torch/nn/functional.py", line 2824, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. – fisakhan Oct 04 '21 at 14:18
Great. I switched to CPU and the error in now clear! – Amir Pourmand Jul 30 '22 at 09:48
2

I am sorry to ressurrect this. but I am facing the same issue but when I run in CPU the error does not happen. Is there any other procedure I can try to find out what is happening? – Nov 30 '22 at 14:09

score 6 · Answer 2 · answered Jan 28 '22 at 21:26

6

As the other respondents indicated: Running it on CPU reveals the error. My target labels where {1,2} I changed them to {0,1}. This procedure solved it for me.

answered Jan 28 '22 at 21:26

tschomacker

631
10
18

score 3 · Answer 3 · answered Jun 03 '22 at 21:38

3

Double-check the number of gpu. Normally, it should be gpu=0 unless you have more than one gpu.

answered Jun 03 '22 at 21:38

Hoyeol Kim

219
2
4

score 3 · Answer 4 · answered Jun 27 '23 at 13:13

3

Bumped into the same issue when using Transformers Trainer. In my case, the issue was caused by model input and tokenizer length sizes mismatch. Here's what solved the issue for me:

model.resize_token_embeddings(len(tokenizer))

and mismatch was caused when adding pad token:

tokenizer.add_special_tokens({'pad_token': '<pad>'})

answered Jun 27 '23 at 13:13

Treetagger is a nightmare

221
2
3
8

great, thanks! same error was on LLaMA2 – germanjke Jul 31 '23 at 17:23

Shaida Muhammad · Answer 5 · 2022-02-23T07:58:05.787

2

1st time:

Got the same error while using simpletransformers library to fine-tuning transformer-based model for multi-class classification problem. simpletransformers is a library written on the top of transformers library.

I changed my labels from string representations to numbers and it worked.

2nd time:

Face the same error again while training another transformer-based model with transformers library, for text classification. I had 4 labels in the dataset, named 0,1,2, and 3. But in the last layer (Linear Layer) of my model class, I had two neurons. nn.Linear(*, 2)* which I had to replace by nn.Linear(*, 4) because I had total four labels.

edited Feb 23 '22 at 07:58

answered Nov 23 '21 at 17:06

Shaida Muhammad

1,428
14
25

1

What is a "string representation"? Do you mean one-hot vector? – Blade Dec 10 '21 at 23:38
2

For example, I have a sentiment analysis problem with two labels, "Positive" and "Negative". I changed my labels from "Positive" to 1 and from "Negative" to 0, in my data. This is what I mean by "changing labels from string representation to numbers." – Shaida Muhammad Dec 12 '21 at 04:30

score 2 · Answer 6 · answered Aug 02 '22 at 10:05

2

I had the same problem on Colab as well. If your code runs normally on device("cpu"), try deleting the current Colab runtime and restart it. This worked for me.

answered Aug 02 '22 at 10:05

yuchen2727

21
1

score 1 · Answer 7 · answered Apr 27 '22 at 09:01

1

Maybe, I mean in some cases

It is due to you forgetting to add a sigmoid activation before you send the logit to BCE Loss.

Hope it can help :P

answered Apr 27 '22 at 09:01

Tiffany Zhao

65
1
4

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 28 '22 at 04:40

score 1 · Answer 8 · answered Apr 30 '22 at 11:41

1

I also encountered this problem and found the reason, because the vocabulary dimension is 8000, but the embedding dimension in my model is set to 5000

answered Apr 30 '22 at 11:41

yun li

11
2

score 1 · Answer 9 · answered Jun 21 '23 at 03:52

This is an open-ended question for most people who land on this page because the underlying issue is different in each case. In my case the error appeared on Colab when I tried to run this notebook on Colab pro: https://colab.research.google.com/drive/1SRclU2pcgzCkVXpmhKppVbGW4UcCs5xT?usp=sharing at supervised_finetuning_trainer.train() step.

If there's someone like me who could not bring the computation into CPU instead of GPU (mostly because the error stack-trace led to a different package like transformers, ..., leading all the way back to pytorch), here's the approach to get a more accurate stack-trace:

https://github.com/huggingface/transformers/blob/ad78d9597b224443e9fe65a94acc8c0bc48cd039/docs/source/en/troubleshooting.md?plain=1#L110

Credits: sgugger on GitHub.

score 0 · Answer 10 · answered Feb 28 '22 at 21:27

I am a filthy casual coming from the VQGAN+Clip "ai-art" community. I get this error when I already have a session running on another tab. Killing all sessions from the session manager clears it up, and let's you connect with the new tab, which is nice if you have fiddled with a lot of settings you don't want to loose

score 0 · Answer 11 · answered Apr 01 '23 at 06:10

0

In my case, I first tried to run my computations on the CPU to detect the actual issue. It turned out that my image transforms were wrong I was applying some un-necessary transformation to my mask image

answered Apr 01 '23 at 06:10

Mahmood Hussain

423
5
14

score 0 · Answer 12 · answered Aug 03 '23 at 12:32

0

I also ran into a similar error, and the problem was with the label mismatch only! My train set and test set had different label counts and thus, this error was coming.

answered Aug 03 '23 at 12:32

AMAN SWARAJ

23
5

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 07 '23 at 06:20

Pytorch fails with CUDA error: device-side assert triggered on Colab

12 Answers12

Linked