How to run half precision inference on a TensorRT model, written with TensorRT C++ API?

Question

I'm trying to run half precision inference with a model natively written in TensorRT C++ API (not parsed from other frameworks e.g. caffe, tensorflow); To the best of my knowledge, there is no public working example of this problem; the closest thing I found is the sampleMLP sample code, released with TensorRT 4.0.0.3, yet the release notes say there is no support for fp16;

My toy example code can be found in this repo. It contains API-implemented architecture and inference routine, plus the python script I use to convert my dictionary of trained weights to the wtd TensorRT format.

My toy architecture only consists of one convolution; the goal is to obtain similar results between fp32 and fp16, except for some reasonable loss of precision; the code seems to work with fp32, whereas what I obtain in case of fp16 inferencing are values of totally different orders of magnitude (~1e40); so it looks like I'm doing something wrong during conversions;

I'd appreciate any help in understanding the problem.

Thanks,

f

Terna K · Answer 1 · 2018-09-12T20:32:38.533

After quickly reading through your code, I can see you did more than is necessary to get a half precision optimized network. You shouldn't manually convert the loaded weights from float32 to float16 yourself. Instead, you should create your network as you normally would and call nvinfer1::IBuilder::setFp16Mode(true) with your nvinfer1::IBuilder object to let TensorRT do the conversions for you where suitable.

How to run half precision inference on a TensorRT model, written with TensorRT C++ API?

1 Answers1