3

I have ~50000 images and annotation files for training a YOLOv5 object detection model. I've trained a model no problem using just CPU on another computer, but it takes too long, so I need GPU training. My problem is, when I try to train with a GPU I keep getting this error:

OSError: [WinError 1455] The paging file is too small for this operation to complete

This is the command I'm executing:

train.py --img 640 --batch 4 --epochs 100 --data myyaml.yaml --weights yolov5l.pt

CUDA and PyTorch have successfully been installed and are available. The following command installed with no errors:

pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio===0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

I've found other people online with similar issues and have fixed it by changing the num_workers = 8 to num_workers = 1. When I tried this, training started and seemed to get past the point where the paging file is too small error appears, but then crashes a couple hours later. I've also increased the virtual memory available on my GPU as per this video (https://www.youtube.com/watch?v=Oh6dga-Oy10) that also didn't work. I think it's a memory issue because some of the times it crashes I get a low memory warning from my computer.

Any help would be much appreciated.

mark_1985
  • 146
  • 1
  • 2
  • 10
  • What is your train batch size and test batch size? – yakhyo Nov 11 '21 at 09:01
  • I’ve tried loads of different batch sizes, ranging from 2 to 32 and still the same issue. Dropping the batch size to 2 and the num_workers=1 was the only thing that started training but then my computer crashed after less than an hour. – mark_1985 Nov 11 '21 at 09:14
  • I think it is because when it comes to validating there is not enough memory. So your training crashes. One way is set smaller input size and smaller batch size, I hope it helps – yakhyo Nov 11 '21 at 09:47
  • Thanks yakhyo, you mean reduce --img 640? – mark_1985 Nov 11 '21 at 09:51
  • yes, you're welcome – yakhyo Nov 11 '21 at 09:52

1 Answers1

4

So I've managed to fix my specific problem and thought posting the answer here might help someone else. Basically, I don't think I had enough RAM. I was using 8 GB before and I've upgraded to 32GB and it's working fine.

As I wrote in the question above, I thought it was a memory issue and I got it to work on another computer only using CPU. I also noticed that when training started there was a spike in RAM usage. This guy also explains the importance of RAM when training deep learning models on large datasets: https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/

Hope this can help other people with the same issue.

mark_1985
  • 146
  • 1
  • 2
  • 10