Train deep learning model in Amazon EC2 is extremely slow

Question

I am having big speed troubles training YOLOv5 into my p2.xlarge aws ec2 instance which has a NVIDIA Tesla K80.

I realized the training process was even slower than my desktop PC who has a NVIDIA RTX 2060. So I decided to inference over some images and these were the results:

My RTX 2060:

AWS EC2 Tesla K80:

So I decided to try a p2.8xlarge instance to train my deep learning model and the results were similars, hence I inferenced over the same images and my surprise was I got similar results.

AWC EC2 with 8 Tesla K80:

It is important to remember that this p2.8xlarge instance has 488 MB of memory RAM and 32 vCPU cores and 8 Tesla K80, so my question is: How is this p2.8xlarge even slower training YOLO than my PC Desktop with just 64 MB of memory RAM and 16 cores?

Has anyone had these same problems? Any solution or some tip you can give me please?

At the end I trained the model over my PC, but it took too much time. On the other hand, cloud environments are supposed to solve these problems.

It seems I am not the only guy who happens this:

score 0 · Answer 1 · answered Oct 22 '21 at 02:53

0

The Tesla k80 is old and doesn't have tensor cores. Training mostly occurs on GPU so the CPU and RAM doesn't really affect it too much. The speed of the K80 truly is worse, and was useful because of the amount of FLOPS it could produce on double precision training.

The 2060 is 2 generations of architecture ahead as well... so in terms of speed it's definitely going to be better.

answered Oct 22 '21 at 02:53

ZWang

832
5
14

1

Ok, but I am talking my RTX 2060 is faster than 8 Tesla K80, this doesn't make any sense – Henry Navarro Oct 22 '21 at 14:39

Train deep learning model in Amazon EC2 is extremely slow

1 Answers1