Is an ECC ram enabled GPU necessary for a server, or will a normal gpu work fine in a server?

Question

Is it a requirement for a server to use ECC ram on a GPU while the normal CPU ram is ECC? Im thinking that instead of using a Quadro k6000 or AMD Firepro, we could use a GTX 980 or AMD r9 290...if possible... Also, if it is not necessarily required to use ECC ram... than is there a "server" gpu (i7 processor is "like" a server E7... E7 is generally for a server/workstation, as i7 is generally for a desktop)... Please help!!!

shodanshok · Answer 1 · 2023-01-22T11:13:26.220

GPU ECC RAM is not a strict requirement for any server; it is in no mode correlated to the use of ECC system RAM.

Still, in some circumstances, GPU with ECC memories are strongly preferred. Basically, you need ECC VRAM if you use the GPU for high-accuracy GPU-compute task (think to double-precision Folding@Home or similar). It is not a coincidence that ECC VRAM are mostly found in compute-grade video card (eg: tesla K10, Titan), while their equivalent gaming-oriented brother use normal, non-ECC RAM.

When used for CAD/CAM and/or post-processing rendering, ECC RAM is a welcome addition but not an absolute necessity. For gaming, ECC VRAM is near to useless.

What really put apart Quadros from consumer-class video card is not ECC memory, but the driver you can install on the former. CAD/CAM/3D modelling require the manipulation of complex wireframe model, where geometry engine and line antialiasing capabilities are pushed to the limit (in contrast to games, that stess texturing, fillrate and full screen antialiasing). The right card (eg: low or mid end Quadro) with the right driver can push 2X or 3X better performance than an higher-end consumer card. Here you can find some examples.

You can use the Studio driver with the "gaming" cards like the RTX 3060 now so that "pro drivers" differentiation is gone these days. — Chris Smith, Jan 22 '23 at 05:10
Still the issue is really what you do. Using the GPU for AI work means a flipped bit may invalidate a model - different from using the card only for some UI work where a single bit flip is not going to have consequences. — TomTom, Jan 22 '23 at 09:46

score 2 · Accepted Answer · answered Feb 28 '15 at 22:40

The biggest problem with using desktop cards in a server isnt about memory (which won't matter), its space and power.

The server cards are usually smaller, without the massive 2-3 slot heatsinks and fans desktop cards can have.

They also usually don't require an extra power cable. Most servers don't have a 6 or 8 pin video card power connector (some may, or you may be able to hack one in).

Heat is also an issue - in small rackmount systems there is only so much heat that can be removed with 1 inch fans.

And lastly drivers - some desktop cards wont have proper drivers for server operating systems. Sometimes you can use the equivelent client OS drivers, sometimes not.

The other difference is how the cards perform at various tasks. Desktop cards are designed for gaming. Server and workstation cards usually excel at 2d performance for things like gpu acceleration in terminal servers, and things like autocad rendering. They also tend to be more stable, and cost a lot more.

If a desktop card will fit your server, have the appropriate power connections, not overheat, and offer the type of performance you need, go for it.

he literally answered the entire question. " isnt about memory (which won't matter)," which answered the first part of the question, and then his longer explanation answered the second part of the question. ECC RAM isn't required, and a "normal" GPU might not work fine in a server anyway. — mfinni, Jan 22 '23 at 11:27

Chris Smith · Answer 3 · 2023-02-22T01:12:39.050

2023-02-21: Passmark's MemTest86 tool documentation has good info on ECC.

ECC memory is meant to protect you from random bit flips from like cosmic rays.

Google did a study and concluded:

About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year... the number of correctable errors per DIMM is highly variable, with some DIMMs experiencing a huge number of errors, compared to others.

Granted this study was for system RAM and not VRAM but corruption can happen but IMO I'd say its a low probability and thats what backups are for, hopefully :-).

I think there are other more important factors to consider within the same GPU series (3000 series for example) when buying a GPU: amount of VRAM, physical size, cooling/noise, power and nvLink support.

For example I have a RTX A2000 w/ 12GB ECC VRAM and a Asus ROG Strix RTX 3060 w/ 12GB NON-ECC VRAM. The 3060 can use both the Gaming and Studio nVIDIA drivers, its faster, it runs cooler, has the same amount of VRAM, and is quieter. Yes its big taking up 3 slots, uses more power and I can't use nvLink but I have space in my case, only need 1 GPU, power no biggie and will roll the dice on bit flips.

Its just trade offs at the end of the day, like most things in computing.

My main problem with the workstation GPUs is the noise of those whiny blower fans so I'm willing to give up some things for lower noise as long as I get enough VRAM.

Regarding "that's what backups are for", I suppose the main problem with that reasoning is that it's likely not immediately obvious that your GPGPU computations silently produced an incorrect result because of that bit flip (that's what ECC is for, I suppose), and who knows how many secondary errors may have been introduced from using that result before someone starts asking what is going on. But it of course depends on what you use that GPU for , whether that is a concern or not. — Håkan Lindqvist, Jan 22 '23 at 15:33
I will add it does look like they make a [heatsink mod](https://www.aliexpress.us/item/3256803558040033.html?pdp_npi=2%40dis%21USD%21US%20%2438.00%21%2438.00%21%21%21%21%21%402103222716744936687048766eb335%2112000027013732918%21btf&_t=pvid%3Ad8b9c91f-00d3-404a-bd13-332e20bc9694&afTraceInfo=1005003744354785__pc__pcBridgePPC__xxxxxx__1674493669&spm=a2g0o.ppclist.product.mainProduct&gatewayAdapt=glo2usa&_randl_shipto=US) for the A4000 which maybe gets rid of the annoying noise of the workstation GPUs but I haven't tried it. — Chris Smith, Jan 23 '23 at 17:10

score 0 · Answer 4 · answered Jan 22 '23 at 13:06

RAM is irrelevant. The biggest problem is that consumer GPUs are nowadays intentionally built so that the power cable won't fit in there (they moved it from the rear to the side).

You can’t use consumer GPUs in datacenters

Sometimes they intentionally leave known bugs in the GTX/RTX driver while they fix it in the workstation / server cards. Which cost 5 times the money, of course.

NV also put various legal restrictions in the EULAs which explicitly prohibit the use of such cards in data centers. So, yes, you can kinda use consumer GPUs in the data center, but you'll face a lot of problems.

Is an ECC ram enabled GPU necessary for a server, or will a normal gpu work fine in a server?

4 Answers4