0

Consider the following network:

%%time
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(1, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 400) #a,b,c,d

        def forward(self, x):
            x=torch.tanh(self.fc1(x))
            x=torch.tanh(self.fc2(x))
            x=self.out(x)
            return x

nx = net_x()

#input
val = 100
t = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(t, (val,1)) #reshape for batch

#method 
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)

This outputs

CPU times: user 11.1 s, sys: 3.52 ms, total: 11.1 s
Wall time: 11.1 s

However, when I change to using the GPU with .to(device):

%%time
import torch
from torch.autograd import grad
import torch.nn as nn
import torch.optim as optim
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class net_x(nn.Module): 
        def __init__(self):
            super(net_x, self).__init__()
            self.fc1=nn.Linear(1, 20) 
            self.fc2=nn.Linear(20, 20)
            self.out=nn.Linear(20, 400) #a,b,c,d

        def forward(self, x):
            x=torch.tanh(self.fc1(x))
            x=torch.tanh(self.fc2(x))
            x=self.out(x)
            return x

nx = net_x()
nx.to(device)
#input
val = 100
t = torch.rand(val, requires_grad = True) #input vector
t = torch.reshape(t, (val,1)).to(device) #reshape for batch

#method 
dx = torch.autograd.functional.jacobian(lambda t_: nx(t_), t)

This outputs:

CPU times: user 18.6 s, sys: 1.5 s, total: 20.1 s
Wall time: 19.5 s

Update 1: Checking the timing of the process of moving the inputs and model to device:

%%time
nx.to(device)
t.to(device)

This outputs:

CPU times: user 2.05 ms, sys: 0 ns, total: 2.05 ms
Wall time: 2.13 ms

Update 2: Looks like @Gulzar was right. I changed my batch size to 1000 (val=1000) and the CPU outputs: Wall time: 8min 44s While the GPU outputs: Wall time: 3min 12s

Gulzar
  • 23,452
  • 27
  • 113
  • 201
Penguin
  • 1,923
  • 3
  • 21
  • 51
  • Not sure if it applies here, but could be due to the added cost of copying data from CPU to GPU. Perhaps you can check the time taken for the copy step separately as well. The benefits of GPUs outweigh this copy cost at large scales, generally. – GoodDeeds May 11 '21 at 14:58
  • Good point. I checked (see my update) – Penguin May 11 '21 at 15:00
  • 1
    What's your batch size? The data is very small, so probably no much benefit in GPU parallelization for small batches. – Gulzar May 11 '21 at 15:07
  • @Gulzar Yea you were right. See my update 2 – Penguin May 11 '21 at 15:31
  • This has been asked hundreds of times (just see https://stackoverflow.com/search?q=gpu+slow+cpu ), there is no need for new questions about this topic. My favorite answer is here: https://stackoverflow.com/questions/55749899/training-a-simple-model-in-tensorflow-gpu-slower-than-cpu – Dr. Snoopy May 11 '21 at 17:23

1 Answers1

7

A hand-wavey answer

GPUs are "weaker" computers, with much more computing cores than CPUs.
Data has to be passed to them from RAM memory to GRAM in a "costly" manner, every once in a while, so they can process it.

If data is "large", and processing can be parallelized on that data, it is likely computing there will be faster.

If data is not "big enough", the cost of transferring the data, or the cost of using weaker cores and synchronizing them can outweigh the benefit of parallelization.


When will GPU be useful?

  1. For larger networks, or for heavier computations, such as convolutions, or larger fully connected layers (larger matrix multiplications)
  2. For larger batches - batches are a very easy way to parallelize computation, as they are (almost*) independent. *Almost, as they do need to be synchronized programmatically at some point.
Gulzar
  • 23,452
  • 27
  • 113
  • 201