Inferring a large language model on a GPU with not enough video RAM

Question

I'm trying some experiments running downloaded language models on a desktop machine. Specifically so far Bloom 3B and 7B on a machine with 32GB RAM, a 2-core CPU and no GPU.

(Throughout this question, I will be talking only about inferring – actually running – pretrained models. Training workload for large language models is conveniently measured in petaflop-days; doing that on a desktop is obviously out of the question.)

As might be expected, running these models on a weak CPU is somewhat slow, minutes to tens of minutes per run, and it would be nice to have hardware that would do it faster. But the limiting factor actually seems to be memory.

Bloom-3B downloads as 6GB, takes about 12GB RAM when running. Bloom-7B downloads as 14GB, takes about 28GB RAM when running. (I'm guessing this is because they download as fp16, but the CPU doesn't understand that format, so they need to be expanded to fp32 for running?) That means, with Windows and other stuff in the background, Bloom-7B struggles to run on this machine, and anything bigger would not work.

It is usually said that neural networks prefer to run on a GPU. I looked at the list of graphics cards on offer from one supplier, and the top-end Nvidia card, costing about $2000 (several times the value of this entire computer!), only has 24GB of video memory. On the face of it, that means if I had that card, it would be amazingly fast at running Bloom-3B, but Bloom-7B (which is obviously much more interesting), either wouldn't run at all, or would have to spend most of its time swapping data from main memory, so the speed of the GPU would be wasted. (Which of those is the case? Would it run at all? Would there be a memory-saving benefit from being able to keep the parameters as fp16, or am I misunderstanding the issue with that?)

If this analysis is correct, then when you are inferring (rather than training) a large language model, the usual wisdom about GPU's is inapplicable, and what you actually need is a fast CPU and lots of RAM.

Is that correct, or am I missing something?

NVIDIA RTX 6000 has 48gb, for larger ones - a lot of conventional ram, 2tb ram servers are quite easy to find, yes they are expensive (only ram will cost around $20k) — Iłya Bursov, Mar 15 '23 at 03:28

Inferring a large language model on a GPU with not enough video RAM

0 Answers0