How can I tackle GPU OOM issues with Resnet, FCn, DeepLab image segmentation evaluations?

Question

I'm working off of a youtube tutorial about image segmentation in Python: link

This tutorial is based on others which I have been referencing for refinement purposes, specifically this one: OpenCV Pytorch Segmentation

I'm using the NVDIA 2070 graphics card with 8 GB GPU memory.

My issue is, the original tutorial taught a basic CPU implementation of a semantic segmentation program using Resnet through FCN. I wanted to build off of it to utilize GPU, so I found the latter tutorial. I don't really have any experience in this area, but I figured out how to run this on GPU and instantly ran into a GPU OOM issue:

RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 8.00 GiB total capacity; 5.85 GiB already allocated; 26.97 MiB free; 5.88 GiB reserved in total by PyTorch)

When I run this program on a small image, or I reduce image quality of an HD image to 50% resolution, I do not get an OOM error.

My poking and prodding has led me to believe that my OOM is the result of how memory is being allocated across this task. So now I've tried implementing the alternate, DeepLab solution, hoping it would allocate memory more efficiently, but that is not the case.

Here is my code:

from PIL import Image
import torch
import torchvision.transforms as T
from torchvision import models
import numpy as np
import imghdr

fcn = None
dlab = None

def getRotoModel():
    global fcn
    global dlab
    fcn = models.segmentation.fcn_resnet101(pretrained=True).eval()
    dlab = models.segmentation.deeplabv3_resnet101(pretrained=1).eval()

# Define the helper function
def decode_segmap(image, nc=21):

    label_colors = np.array([(0, 0, 0),  # 0=background
                           # 1=aeroplane, 2=bicycle, 3=bird, 4=boat, 5=bottle
               (128, 0, 0), (0, 128, 0), (128, 128, 0), (0, 0, 128), (128, 0, 128),
               # 6=bus, 7=car, 8=cat, 9=chair, 10=cow
               (0, 128, 128), (128, 128, 128), (64, 0, 0), (192, 0, 0), (64, 128, 0),
               # 11=dining table, 12=dog, 13=horse, 14=motorbike, 15=person
               (192, 128, 0), (64, 0, 128), (192, 0, 128), (64, 128, 128), (192, 128, 128),
               # 16=potted plant, 17=sheep, 18=sofa, 19=train, 20=tv/monitor
               (0, 64, 0), (128, 64, 0), (0, 192, 0), (128, 192, 0), (0, 64, 128)])

    r = np.zeros_like(image).astype(np.uint8)
    g = np.zeros_like(image).astype(np.uint8)
    b = np.zeros_like(image).astype(np.uint8)

    for l in range(0, nc):
        idx = image == l
        r[idx] = label_colors[l, 0]
        g[idx] = label_colors[l, 1]
        b[idx] = label_colors[l, 2]

    rgb = np.stack([r, g, b], axis=2)
    return rgb

valid_images = ['jpg','png', 'rgb', 'pbm', 'ppm', 'tiff', 'rast', 'xbm', 'bmp', 'exr', 'jpeg'] #Valid image formats
dev = torch.device('cuda')
def createMatte(filename, matteName, factor):
    if imghdr.what(filename) in valid_images:
        img = Image.open(filename).convert('RGB')
        
        size = img.size
        w, h = size
        modifiedSize = h * factor
        print('Image original size is ', size)
        print('Modified size is ', modifiedSize)
        trf = T.Compose([T.Resize(int(modifiedSize)),
                     T.ToTensor(), 
                     T.Normalize(mean = [0.485, 0.456, 0.406], 
                                 std = [0.229, 0.224, 0.225])])
        inp = trf(img).unsqueeze(0)
        #inp = trf(img).unsqueeze(0).to(dev)
        
        if (fcn == None): getRotoModel()
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            inp = inp.to(dev)
            fcn.to(dev)
            out = fcn.to(dev)(inp)['out'][0]
        
        with torch.no_grad():
            out = fcn(inp)['out'][0]
        
        #out = fcn(inp)['out']
        #out = fcn.to(dev)(inp)['out']
        om = torch.argmax(out.squeeze(), dim=0).detach().cpu().numpy()  
        rgb = decode_segmap(om)
        im = Image.fromarray(rgb)
        im.save(matteName)
    else:
        print('File type is not supported for file ' + filename)
        print(imghdr.what(filename))
        
def createDLMatte(filename, matteName, factor):
    if imghdr.what(filename) in valid_images:
        img = Image.open(filename).convert('RGB')
            
        size = img.size
        w, h = size
        modifiedSize = h * factor
        print('Image original size is ', size)
        print('Modified size is ', modifiedSize)
        trf = T.Compose([T.Resize(int(modifiedSize)),
            T.ToTensor(), 
            T.Normalize(mean = [0.485, 0.456, 0.406], 
                        std = [0.229, 0.224, 0.225])])
        inp = trf(img).unsqueeze(0)
        #inp = trf(img).unsqueeze(0).to(dev)
            
        if (dlab == None): getRotoModel()
            
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            inp = inp.to(dev)
            dlab.to(dev)
            out = dlab.to(dev)(inp)['out'][0]
            
        with torch.no_grad():
            out = dlab(inp)['out'][0]
            
        #out = fcn(inp)['out']
        #out = fcn.to(dev)(inp)['out']
        om = torch.argmax(out.squeeze(), dim=0).detach().cpu().numpy()  
        rgb = decode_segmap(om)
        im = Image.fromarray(rgb)
        im.save(matteName)

What I want to know is, is there a solution to the GPU issue? I don't want to limit myself to CPU rendering, which takes about a minute per image, when I have a generally powerful GPU. As I said I'm quite new to most of this, but I'm hoping there's a way to either allocate memory during this process better on my end.

I have a few potential solutions but I am at a loss to find resources for implementation.

(poor solution) capping calculations on the GPU when it's near the end of memory and switching the remainder of the task to CPU. Not only do I feel like this is a poor, I also don't really see how GPU CPU switching could be implemented mid task.
(better) fixing memory allocation by segmenting the image into manageable bits, and saving those bits off into temporary files, then combining them in the end.
Some combination of both.

Now my worry is that segmenting the image will reduce the quality of the outcome because each piece won't be in context, I'd need some sort of intelligent stitching and that's way outside of my paygrade.

So I'm generally asking if there are resources out there to tackle those possible solutions, or if there is a better one.

Finally, is there something wrong with my implementation of this that is causing the GPU OOM error? I can't tell if it's my code that isn't optimized or if DeepLab and FCN are both just super memory intensive and unoptimizable from my end. Any help would be super appreciated! Thanks!

How can I tackle GPU OOM issues with Resnet, FCn, DeepLab image segmentation evaluations?

0 Answers0