Load ONNX model with low-bit quantized weight

Question

I have quantized my model to 2-bit and packed them into uint8 format (store 4x 2-bit weight in an uint8 variable) in pytorch. In this way, the model size has been reduced from 1545M to 150M, and the VRAM for loading the model is also greatly reduced (from 2500M to 1000M). However, when I export the model to onnx, while the onnx model size is also reduced (around 190M), the VRAM for loading the model can reach 3000M. I guess the uint8 parameters are cast to int32 or float32 during loading the model.

Anyone know how to lower the VRAM for loading this ONNX model?

score 0 · Answer 1 · answered Jul 14 '23 at 07:50

Note I am not an expert on the subject, and this answer likely contains factual errors. Posting this hoping that someone will correct me as I am also interested in the matter.

If you exported a torch model to ONNX with their quantization tools, the resulting model is likely in QDQ format. What this means, as far as I have understood it, is that in the exported graph there is a quantization and a dequantization layer inserted before every Operator. Then when onnxruntime parses and optimizes the model, these structures are detected and replaced with the correct integer Operators.

An illustration, leftmost graph is the original model, center is the exported ONNX model, and on the right what onnxruntime actually executes after optimizing the graph. So during the creation of an InferenceSession, intermediate tensors are promoted to INT32 or FP32, hence the high memory use.

My suggestion would be to try Operator-oriented quantization, where instead of the fake QDQ layers, the ONNX model has the correct integer Operators in the graph definition before any optimizations. ORT provides tools for both quantization formats. Albeit I have no idea how all of this works with your 2-bit packing scheme.

Another thing worth trying is to play around with SessionOptions.graph_optimization_level. Graph optimization during initalization can consume a lot of memory, disabling or toning it down could potentially help a bit.

Load ONNX model with low-bit quantized weight

1 Answers1