Variable Performance when Training Resnet on Google Compute Engine with data from Google Cloud Storage

Question

I'm attempting to train a resnet18 image classifier on a large number of crops from labeled Google Streeview data. I'm following along with this tutorial. I have two datasets, one of approximately ~20k images and one of approximately ~100k images. Both datasets are stored in the same format, and both have been uploaded to respective Google Cloud Storage buckets. I then mounted both of these buckets in my VM's home directory using gcsfuse with the --implicit-dirs flag.

I then run my train.py file on my Google Compute Engine VM, which was created from the Deep Learning VM image on Google's Cloud Marketplace. The VM has a single vCPU, a single Nvidia Tesla K80 GPU, 3.75gb of memory, and a 100gb persistent disk.

When I run the training script, I make no changes except pointing the dataset_dir variable to the correct gcsfuse-mounted directory on the VM.

When I run train.py on the 100k crops directory, it runs relatively quickly, with a single epoch taking ~30 mins. I jump into top while it's running, and CPU utilization is quite high, staying around 90%.

However, using the same VM, when I run train.py on the 20K crops directory, it runs much more slowly, with a single epoch taking 6-7 hours, despite the smaller size of the dataset. In this case, CPU utilization never jumps above about 5%.

I cannot figure out what is causing the slowdown, as nothing (as far as I can tell) is different between the two runs except for the datasets, which are both formatted identically. I use the same pytorch dataloader with the same number of threads. Both GCS buckets are in the same region, us-west1 which is the same region as my VM instance.

It seems likely that somehow one bucket is IO-limited relative to the other bucket, but I cannot figure out why.

Any thoughts are appreciated!

My train.py file is below.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import numpy as np
import torchvision
from torchvision import datasets, models, transforms
import matplotlib.pyplot as plt
import time
import os
import copy
from collections import defaultdict



data_transforms = {
    'Test': transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
    'Val': transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}


data_dir = 'home/gweld/sliding_window_dataset/'
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['Test', 'Val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                             shuffle=True, num_workers=4)
              for x in ['Test', 'Val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['Test', 'Val']}
class_names = image_datasets['Test'].classes

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")




def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['Test', 'Val']:
            if phase == 'Test':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            class_corrects = defaultdict(int)
            class_totals   = defaultdict(int)

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'Test'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'Test':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

                for index, pred in enumerate(preds):
                    actual = labels.data[index]
                    class_name = class_names[actual]

                    if actual == pred: class_corrects[class_name] += 1
                    class_totals[class_name] += 1

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print('{} Loss: {:.4f} Acc: {:.4f}'.format(
                phase, epoch_loss, epoch_acc))

            if phase == 'Val':
                print("Validation Class Accuracies")

                for class_name in class_totals:
                    class_acc = float(class_corrects[class_name])
                    class_acc = class_acc/class_totals[class_name]

                    print("{:20}{}%".format(class_name, 100*class_acc))
                print("\n")

            # deep copy the model
            if phase == 'Val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(
        time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))

    # load best model weights
    model.load_state_dict(best_model_wts)
    return model




model_ft = models.resnet18(pretrained=True)
num_ftrs = model_ft.fc.in_features
model_ft.fc = nn.Linear(num_ftrs, 5) # last arg here, # classes? -gw

model_ft = model_ft.to(device)

criterion = nn.CrossEntropyLoss()

# Observe that all parameters are being optimized
optimizer_ft = optim.SGD(model_ft.parameters(), lr=0.001, momentum=0.9)

# Decay LR by a factor of 0.1 every 7 epochs
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)


# Train and evaluate
# ^^^^^^^^^^^^^^^^^^

print('Beginning Training on {} train and {} val images.'.format(dataset_sizes['Test'], dataset_sizes['Val']))


model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
                       num_epochs=25)




torch.save(model_ft.state_dict(), 'models/test_run_resnet18.pt')

Both dataset in same bucket ? If not check gsutil ls -L -b gs://<> |grep Storage . — howie, Mar 02 '19 at 10:57
If you’re running the same CPU/GPU ML process on both buckets, and one is much slower, it stands to reason that something is causing the IO performance to be different between the buckets. Is the slow one using Nearline / Coldline [storage classes](https://cloud.google.com/storage/docs/storage-classes#nearline)? Does the fast one have [caching](https://cloud.google.com/storage/docs/metadata#cache-control) enabled? Is there anything different about the way the FUSE filesystem is being mounted / configured? — Dan, Mar 03 '19 at 03:35

Variable Performance when Training Resnet on Google Compute Engine with data from Google Cloud Storage

0 Answers0