8

It is a really strange bug. Environment: tf 1.12 + cuda9.0 + cudnn 7.5 + single RTX 2080

Today I tried to train YOLO V3 network on my new device. Batch size is 4. Every thing went right at the beginning, training started as usual and I could see the loss reduction during train process.

But, at around 35 round, it reported a message:

2020-03-20 13:52:01.404576: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered 2020-03-20 13:52:01.404908: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

and exited train process.

I have tried several times. It happened randomly. Maybe 30 minutes or several hours after training process started.

But if I changed batch size to 2. It could train successfully.

So why this happened? If my environment is not right or not suitable for RTX 2080, this bug should happened at the early begining of the train progress but middle. The layers in my yolo network was all trainable at beginning so there was nothing change during training process. Why it could train correctly at the first round but fail in middle? Why smaller batch size could train successfully?

And what should I do now? The solutions I can thought are: 1:Compile tf 1.12 in cuda 10 + cudnn 7.5 and try again. 2:Maybe update tensorflow and cuda? All cost a lot.

talonmies
  • 70,661
  • 34
  • 192
  • 269
qishiheibojue
  • 81
  • 1
  • 2
  • Without any look at your code, it's hard to tell what the issue is... Please provide a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) Based on your description it can be anything including issues in your code, out-of-memory errors and much more... – Elijan9 Mar 20 '20 at 10:10
  • Hi, did you find a fix for this? I am having a similar issue. I have two Titan RTXs. It typically occurs with larger batch sizes, say 128 and above. But it's intermittent, it will train for an epoch or two and then error out. I'm running Ubuntu 18.04, TensorFlow 2.2.0 (also tried 2.1.0, same issues). It seems to be related to certain layers - if I remove the two GRU layers in my model the issue goes away. – ChrisM Jun 18 '20 at 10:35
  • @ChrisM Did you figure out what the problem was? I think it has to do with that the card runs out of memory. When I have a large batch size it crashes at some point in training but when the batch size is small it will train but it will take to long so I have to make a sacrifice for the sake of not having my PC on for like 6 hours to train. – Rajivrocks Jun 27 '20 at 11:33
  • @Rajivrocks Hi, thanks for the query. Well after trying many things (multiple CUDA re-installs, changing TF versions, etc.) I ran a liitle tool called [gpu-burn](https://github.com/wilicc/gpu-burn), which indicates that one of my GPUs is faulty. I've contacted my machine vendor and am awaiting a new unit. The machine and cards were brand new, so I'm still a bit suspicious... will add an update when I get my new cards! – ChrisM Jun 27 '20 at 17:35
  • @ChrisM Do you have an update to this? Did replacing the GPUs fix the problem? – mpotma Sep 30 '20 at 21:44
  • @mpotma Thanks - yes, it turned out to be a faulty card, which our supplier replaced. It failed less than 2 weeks after we purchased it, and I would never have imagined it to be a hardware failure at that point! – ChrisM Oct 01 '20 at 14:38
  • @ChrisM how did you determine that it was a card error? – Taylr Cawte Oct 05 '20 at 13:08
  • 1
    @TaylrCawte Thanks for the question. We used [gpu-burn](https://github.com/wilicc/gpu-burn), which told us that our first card was faulty (although not in what way). Find more info on it [here](http://wili.cc/blog/gpu-burn.html). It just runs a big MatMul op, for as long as you specify. As with all programs that may stress your system, use with care. You might also get some info by running the cuda samples, though you'll have to build those. Their location depends on where your cuda toolkit is installed, which might be under `/usr/local/cuda-10.1/lib64` (it is for us, under Ubuntu 18.04). – ChrisM Oct 05 '20 at 15:56

1 Answers1

0

Check if Cuda/Cudnn/Driver versions are ok for your card https://docs.nvidia.com/deeplearning/cudnn/support-matrix/index.html#cudnn-versions-764-765.

If above check turn to be OK then this issue might be because of broken GPU card as commented by @ChrisM.