Titan X Pascal on VGG16 much slower on my machine than in benchmark

Question

I have a Titan X Pascal, Intel i5-6600, 16GB Ram and running torch7 in Ubuntu 14.04. The Nvidia driver version is 375.20, CUDA Toolkit 8.0 and cuDNN v5.1.

I did the same test with the same VGG16 network from Caffe (imported via loadcaffe) as in this Benchmark. However, for a forward pass my setup needs 80ms which is double the time as it apparently needs in the benchmark.

I also generated a batch of 16 images with 3 channels and size 224x224. The relevant code is:

 local model = loadcaffe.load("/home/.../Models/VGG16/VGG_ILSVRC_16_layers_deploy.prototxt",
                          "/home/.../Models/VGG16/VGG_ILSVRC_16_layers.caffemodel",
                          "cudnn")

 for i=1, 50 do
   local input = torch.randn(16, 3, 224, 224):type("torch.CudaTensor")

   cutorch.synchronize()
   local timer = torch.Timer()

   model:forward(input)
   cutorch.synchronize()

   local deltaT = timer:time().real
   print("Forward time: " .. deltaT)
 end

The output is: Forward time: 0.96536016464233 Forward time: 0.10063600540161 Forward time: 0.096444129943848 Forward time: 0.089151859283447 Forward time: 0.082037925720215 Forward time: 0.082045078277588 Forward time: 0.079913139343262 Forward time: 0.080273866653442 Forward time: 0.080694913864136 Forward time: 0.082727193832397 Forward time: 0.082070827484131 Forward time: 0.079407930374146 Forward time: 0.080456018447876 Forward time: 0.083559989929199 Forward time: 0.082060098648071 Forward time: 0.081624984741211 Forward time: 0.080413103103638 Forward time: 0.083755016326904 Forward time: 0.083209037780762 ...

Do I have to do anything additional to get that speed? Or am I doing something wrong here? Or is it maybe because I am using Ubuntu 14.04, instead of Ubuntu 16.04 (although in the benchmark a GTX 1080 running on Ubuntu 14.04 also only needs 60ms)?

why two (if any) calls to cutorch.synchronize? to my knowledge this is relevant only why paralleling gpus. — Elad663, Dec 29 '16 at 03:50
also, each time that you generate the data, you make the copy to the gpu (by the cuda tensor). A more efficient solution is to generate it all, and then to copy eveything to the gpu and then do SGD or mini-batch. you got 12GB on it, enjoy it — Elad663, Dec 29 '16 at 03:52
Cuda does queue jobs on the gpu. So to be able to get a valid timing result i call synchronize() the first time to wait for the tensor to finish copying on the gpu. Then I start the timer, queue the forward job and then with the second synchronize() I wait for the forward() step to finish. — Marcel_marcel1991, Dec 29 '16 at 12:12
I know that it is not efficient to always reallocate new memory on the gpu in each step. This is just a little code snippet to test the forward time and not my actual code. I also only use batch size 16 to have a correct comparison to the benchmark. — Marcel_marcel1991, Dec 29 '16 at 12:15

score 0 · Accepted Answer · answered Dec 29 '16 at 12:18

0

I finally found the solution.

I had to enable the cudnn.benchmark flag:

cudnn.benchmark = true

By default it is set to false and so cudnn does not choose the fastest algorithm. My forward time is now about 39ms.

answered Dec 29 '16 at 12:18

Marcel_marcel1991

678
9
22

Titan X Pascal on VGG16 much slower on my machine than in benchmark

1 Answers1