PyTorch Training exitting after Caching Images

Question

I have a dataset of around 12k Training Images and 500 Validation Images. I am using YOLOv5-PyTorch to train my model. When i start the training, and when it comes down to the Caching Images stage, it suddenly quits.

The code I'm using to run this is as follows:

!python train.py --img 800 --batch 32 --epochs 20 --data '/content/data.yaml' --cfg ./models/custom_yolov5s.yaml --weights yolov5s.pt --name yolov5s_results  --cache

I am using Google Colab to train my model.

This is the command that executes before shutting down:

train: Caching Images (12.3GB ram): 99% 11880/12000 [00:47<00:00, 94.08it/s]

score 0 · Accepted Answer · answered Oct 19 '21 at 07:34

So i solved the above problem. The problem is occuring because we are caching all the images fore-hand as to increase the speed during epochs. Now this may increase the speed but on the other hand, it also consumes memory. When you are using Google Colab, it provides you 12.69GB of RAM. When caching such huge data, all of the RAM was being consumed and there was nothing left to cache validation set hence, it shuts down immediately. There are two basic methods to solve this issue:

Method 1:

I simply reduced the image size from 800 to 640 as my training images didn't contain any small object, so i actually did not need large sized images. It reduced my RAM consumption by 50%

--img 640

train: Caching Images (6.6GB ram): 100% 12000/12000 [00:30<00:00, 254.08it/s]

Method 2:

I had written an argument at the end of my command that I'm using to run this project :

--cache

This command caches the entire dataset in the first epoch so it may be reused again instantly instead of processing it again. If you are willing to compromise on training speed, then this method would work for you. Just simply remove this line and you will be good to go. Your new command to run will be:

!python train.py --img 800 --batch 32 --epochs 20 --data '/content/data.yaml' --cfg ./models/custom_yolov5s.yaml --weights yolov5s.pt --name yolov5s_results

score 0 · Answer 2 · answered Aug 12 '22 at 09:00

Maybe you should add "VRAM consumption" to your title, because this was the main reason your training was crashing.

Your awnser is still right though, but I would like to get into more detail, to why such crashes can happen for people with this kind of problems.

Yolov5 works with Imagesizes of x32. If you have Imagesizes that are not a multiple of x32, Yolov5 will try to strech the image in every epoch and consume a lot of VRAM, that shouldn't be consumed (at least not for this).

Large imagesizes also consume a lot of VRAM, so even if it is a multiple of x32 your setup or config could not be enouth for this training.

The Cache command is speeding up your training, but with the downside of consuming more VRAM.

Batchsizes are a big role of VRAM consumption. If you really want to train with a large Imagesize, you should reduce your batchsize for a multimple of x2.

I hope this helps somebody.

PyTorch Training exitting after Caching Images

2 Answers2