For example, according to https://cocktailpeanut.github.io/dalai/#/ the relevant figures for LLaMA-65B are:
- Full: The model takes up 432.64GB
- Quantized: 5.11GB * 8 = 40.88GB
The full model won't fit in memory on even a high-end desktop computer.
The quantized one would. (But would not fit in video memory on even a $2000 Nvidia graphics card.)
However, CPUs don't generally support anything less than fp32. And when I've tried running Bloom 3B and 7B on a machine without a GPU, sure enough, the memory consumption has seemed to be 12 and 28GB respectively.
Is there a way to gain the memory savings of quantization, when running the model on a CPU?