simple inception block in pytorch taking much much longer time to train on GPU?

Question

I am training very simple inception block followed by a maxpool and fully-connected layer on NVIDIA GeForce RTX 2070 GPU and its taking very long time for an iteration. Just finished 10 iterations in more than 24 hours.

Here is the code for inception model definition

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        input_channels = 3
        conv_block = BasicConv2d(input_channels, 64, kernel_size=1)
        self.branch3x3stack = nn.Sequential(
            BasicConv2d(input_channels, 64, kernel_size=1),
            BasicConv2d(64, 96, kernel_size=3, padding=1),
            BasicConv2d(96, 96, kernel_size=3, padding=1),
        )

        self.branch3x3 = nn.Sequential(
            BasicConv2d(input_channels, 64, kernel_size=1),
            BasicConv2d(64, 96, kernel_size=3, padding=1),
        )

        self.branch1x1 = BasicConv2d(input_channels, 96, kernel_size=1)

        self.branchpool = nn.Sequential(
            nn.AvgPool2d(kernel_size=3, stride=1, padding=1),
            BasicConv2d(input_channels, 96, kernel_size=1),
        )
        self.maxpool = nn.MaxPool2d(kernel_size=8, stride=8)
        self.fc_seqn = nn.Sequential(nn.Linear(301056, 1))

    def forward(self, x):
        x = [
            self.branch3x3stack(x),
            self.branch3x3(x),
            self.branch1x1(x),
            self.branchpool(x),
        ]
        x = torch.cat(x, 1)
        x = self.maxpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc_seqn(x)
        return x  # torch.cat(x, 1)

and the training code I used

def training_code(self, model):
    model = copy.deepcopy(model)
    model = model.to(self.device)
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=self.learning_rate)
    for epoch in range(self.epochs):
        print("\n epoch :", epoch)
        running_loss = 0.0
        start_epoch = time.time()
        for i, (inputs, labels) in enumerate(self.train_data_loader):
            inputs = inputs.to(self.device)
            labels = labels.to(self.device)
            optimizer.zero_grad()
            outputs = (model(inputs)).squeeze()
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
            if i % log_step == log_step - 1:
                avg_loss = running_loss / float(log_step)
                print("[epoch:%d, batch:%5d] loss: %.3f" % (epoch + 1, i + 1, avg_loss))
                running_loss = 0.0
        if epoch % save_model_step == 0:
            self.path_saved_model = path_saved_model + "_" + str(epoch)
            self.save_model(model)
    print("\nFinished Training\n")
    self.save_model(model)

Here is the summary of the convolution neural network. Can anyone please help me how to speed up the training?

Here is the summary of the convolution neural network after reducing the number of inputs to the fully-connected (FC) layer. But still the training time is very high similar as before. Can suggestion on how to speed up the training will be helpful.

What's your batch size? Also `self.fc_seqn = nn.Sequential(nn.Linear(301056, 1))` is ridiculously large input size. You should use `torch.nn.AdaptiveMaxPool2d` instead (pooling across channels), which would give you `96` input to `nn.Linear`. — Szymon Maszke, Aug 04 '20 at 12:20
@SzymonMaszke thanks a lot for the reply. For my case, batch size is 16. other than using pooling across channel is there any other thing I can do to speed up the training process? — MSD Paul, Aug 04 '20 at 18:08
Plethora of things like distributing across many GPUs, using cloud for training etc. It also depends how many data samples you have & what those are (I suppose image, but what's their resolution). Will post an answer, will be easier to go from there. — Szymon Maszke, Aug 04 '20 at 18:19

score 0 · Answer 1 · answered Aug 04 '20 at 18:32

Architecture

Your nn.Linear has huge input which goes towards only one neuron for regression.

You should use torch.nn.AdaptiveAvgPool2d like this:

class Net(nn.Module):
    def __init__(self):
        ... # your stuff before
        self.pool = nn.AdaptiveAvgPool2D(output_size=1)
        self.fc_seqn = nn.Sequential(nn.Linear(96 * 4, 1))

    def forward(self, x):
        x = [
            self.branch3x3stack(x),
            self.branch3x3(x),
            self.branch1x1(x),
            self.branchpool(x),
        ]
        x = torch.cat(x, 1)
        x = torch.squeeze(self.pool(x))
        if len(x.shape) == 1:
            x = torch.unsqueeze(x, dim=0)
        x = self.fc_seqn(x)
        return x

It should also help with your task as it won't be as over-parametrized.

Data

Depending on the size of your image, you may want to crop it (or random crop it) using torchvision.transforms to lower size.

If you can, you could also go with torch.nn.DistributedDataParallel in order to parallelize across many GPUs, but I suppose it's not what you are after.

You could also cache some data after loading an image (see torchdata project (__disclaimer: I'm the author)).

Also torch.utils.data.DataLoader with num_workers argument set to larger number than 1 should speed-up image loading (I usually go with the number of cores my machine has, though you would need some experimentation with it).

Mixed precision

You should largely benefit from this step as your graphic card supports Tensor Cores technology.

In PyTorch 1.6.0 it's pretty easy with torch.cuda.amp package. Basically, you just have to use one context manager and gradient scaler (here and here appropriately).

This allows you to use larger batch size (as architecture has some weight transformed to float16 instead of float32).

thanks a lot for your helpful suggestions. I have added the nn.AdaptiveMaxPool2d to reduce the input of the FC layer to 384 as u suggested. But still the training time is very high similar as before. I have added the modified cnn architecture to my original post. plz take a look and let me know what possible modifications I can made to remedy this. FYI, for me the num_workers is 4 to load the images during training. Is it possible if I change the framework from pytorch to something else say tensorflow then training will be faster for inception block. any suggestions will be helpful. — MSD Paul, Aug 05 '20 at 15:10
Try mixed precision before changing framework. This should help a lot. — Szymon Maszke, Aug 05 '20 at 15:12
can u plz share some links on training using mixed precision? — MSD Paul, Aug 05 '20 at 15:15
You have them in my answer, last paragraph and there's plethora of examples online as well, check them docs. — Szymon Maszke, Aug 05 '20 at 15:16

simple inception block in pytorch taking much much longer time to train on GPU?

1 Answers1

Architecture

Data

Mixed precision