How to convert the model with grid_sample to TensorRT with INT8 quantization?

Question

I am trying to convert the model with torch.nn.functional.grid_sample from Pytorch (1.9) to TensorRT (7) with INT8 quantization throught ONNX (opset 11). Opset 11 does not support grid_sample conversion to ONNX. Thus I used ONNX graphsurgeon together with the external GridSamplePlugin as it is proposed here. With it the conversion to TensorRT (both with and without INT8 quantization) is succesfull. Pytorch and TRT model without INT8 quantization provide results close to identical ones (MSE is of e-10 order). But for TensorRT with INT8 quantization MSE is much higher (185).

grid_sample operator gets two inputs: the input signal and the sampling grid. Both of them should be of the same type. In the GridSamplePlugin only processing of kFLOAT and kHALF is implemented. In my case X coordinate in the absolute sampling grid (before it is converted to the relative one required for grid_sample) is changing in the range [-d; W+d], and [-d; H+d] for Y coordinate. Maximal value of W is 640, and 360 for H. And the coordinates may have non-integer values in this range. For the test purposes I created the test model that contains only grid_sample layer. And in this case TensorRT results with and without INT8 quantization are identical.

Here is the code of the test model:

import torch
import numpy as np
import cv2

BATCH_SIZE = 1
WIDTH = 640
HEIGHT = 360

def calculate_grid(B, H, W, dtype, device='cuda'):
    xx = torch.arange(0, W, device=device).view(1, -1).repeat(H, 1).type(dtype)
    yy = torch.arange(0, H, device=device).view(-1, 1).repeat(1, W).type(dtype)
    xx = xx + yy * 0.25
    if B > 1:
        xx = xx.view(1, 1, H, W).repeat(B, 1, 1, 1)
        yy = yy.view(1, 1, H, W).repeat(B, 1, 1, 1)
    else:
        xx = xx.view(1, 1, H, W)
        yy = yy.view(1, 1, H, W)
    vgrid = torch.cat((xx, yy), 1).type(dtype)
    return vgrid.type(dtype)

def modify_grid(vgrid, H, W):
    vgrid = torch.cat([
        torch.sub(2.0 * vgrid[:, :1, :, :].clone() / max(W - 1, 1), 1.0),
        torch.sub(2.0 * vgrid[:, 1:2, :, :].clone() / max(H - 1, 1), 1.0),
        vgrid[:, 2:, :, :]], dim=1)
    vgrid = vgrid.permute(0, 2, 3, 1)
    return vgrid

class GridSamplingBlock(torch.nn.Module):

    def __init__(self):
        super(GridSamplingBlock, self).__init__()

    def forward(self, input, vgrid):
        output = torch.nn.functional.grid_sample(input, vgrid)
        return output

if __name__ == '__main__':
    model = torch.nn.DataParallel(GridSamplingBlock())
    model.cuda()
    print("Reading inputs")
    img = cv2.imread("result/left_frame_rect_0373.png")
    img = cv2.resize(cv2.cvtColor(img, cv2.COLOR_BGR2GRAY), (WIDTH, HEIGHT))
    img_in = torch.from_numpy(img.astype(float)).view(1, 1, HEIGHT, WIDTH).cuda()
    vgrid = calculate_grid(BATCH_SIZE, HEIGHT, WIDTH, img_in.dtype)
    vgrid = modify_grid(vgrid, HEIGHT, WIDTH)
    np.save("result/grid", vgrid.cpu().detach().numpy())
    print("Getting output")
    with torch.no_grad():
        model.module.eval()
        img_out = model.module(img_in, vgrid)
        img = img_out.cpu().detach().numpy().squeeze()
        cv2.imwrite("result/grid_sample_test_output.png", img.astype(np.uint8))

Saved grid is used for both calibration and inference of the TensorRT model.

So the questions are:

Is it valid to apply INT8 quantization to functions with at least one indexing input (like grid_sample)? Doesn't such quantization lead to significant change of the result (if we apply INT8 quantization to the input with the range [0..640) for example)?
How INT8 quantization works with the custom plugin, if only FP32 and FP16 are implemented in this plugin code?
Is the same result of the test network in TensorRT with and without INT8 quantization obtained due to the fact that the grid_sample input is actually the network input?

My environment:

TensorRT Version: 7
GPU Type: NVidia GeForce GTX 1050 Ti
Nvidia Driver Version: 470.63.01
CUDA Version: 10.2.89
CUDNN Version: 8.1.1
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.7
PyTorch Version (if applicable): 1.9

Steps to reproduce:

Run the test code to save the grid and get Torch result. Use any input image for test.
Build TensorRT OSS with the custom plugin according to this sample. The latest version of TRT OSS requires some adaptation of GridSamplePlugin, so better to use the recomended TensorRT OSS version.
Create ONNX model according to the code example.
Create TensorRT engine with or without INT8 quantization and run the inference. In my C++ code I used https://github.com/llohse/libnpy for reading grid.npy file.

score 0 · Answer 1 · answered Mar 05 '22 at 16:12

You can break your model into 2 parts, one before grid sample and another after it, and do int8 quantization respectively. Having grid_sample work in INT8 will compromise your model performance greatly. This will result in a change in your network structure so it may change the optimization of the graph.

How to convert the model with grid_sample to TensorRT with INT8 quantization?

1 Answers1