Can language model inference on a CPU, save memory by quantizing?

Question

For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are:

Full: The model takes up 432.64GB
Quantized: 5.11GB * 8 = 40.88GB

The full model won't fit in memory on even a high-end desktop computer.

The quantized one would. (But would not fit in video memory on even a $2000 Nvidia graphics card.)

However, CPUs don't generally support anything less than fp32. And when I've tried running Bloom 3B and 7B on a machine without a GPU, sure enough, the memory consumption has seemed to be 12 and 28GB respectively.

Is there a way to gain the memory savings of quantization, when running the model on a CPU?

A side opinion: I think people don't bother implementing it on CPU simply because CPU is just too slow. You can easily get a "freemium" T4 or P100 on Colab or Kaggle, and the speed is significantly higher than probably any CPU. — Minh-Long Luu, Mar 16 '23 at 06:51
@Minh-LongLuu For my initial purposes of running experiments to find out what these models can do at all, slow is tolerable. — rwallace, Mar 16 '23 at 06:53
@Minh-LongLuu That having been said, thanks for the suggestions! I tried Colab on Bloom-7B just now, using a notebook someone already published, and it said 'Your session crashed after using all available RAM. If you are interested in access to high-RAM runtimes, you may want to check out Colab Pro.' — rwallace, Mar 16 '23 at 06:59
x86 support for half-precision FP16 and BF16: see support for the AVX-512 FP16 and BF16 extensions on https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 - Cooper Lake and Zen 4 have BF16, and Alder Lake if you have a model where you can use old microcode to not disable AVX-512. Or Sapphire Rapids (Xeon) has BF16 and FP16. Also see [Half-precision floating-point arithmetic on Intel chips](https://stackoverflow.com/q/49995594) - to save memory footprint, conversion to FP16 during load/store works fine since Ivy Bridge. But no speedup for FPU throughput, only memory. — Peter Cordes, Mar 16 '23 at 07:39
No idea what machine-learning software packages support which of those formats and ways of using them on modern x86 CPUs, though. Zen4 is widely available in non-server machines, and BF16 is supposed to be ideal for neural network stuff (same exponent range as FP32, fewer mantissa bits than FP16). — Peter Cordes, Mar 16 '23 at 07:41

score 0 · Accepted Answer · answered Mar 16 '23 at 18:09

Okay, finally got LLaMA-7B running on CPU and measured: fp16 version takes 14GB, fp32 version takes 28GB. This is on an old CPU that does not have AVX-512, so presumably it's expanding the format on reading into either cache or registers, but either way yes, it is gaining the memory saving.

Can language model inference on a CPU, save memory by quantizing?

1 Answers1