Running a cluster 24x7 at full load. Possible damages?

Question

Let us assume that we have a pool of some 50 computers with 6 cores and 12 threads each.

If someone plans to use it for intensive astrophysics simulation using all of its logical CPUs (50*12) for 24x7, how long will it be able to sustain without any physical damage? Given simple cooling with ACs and the CPUs come with their own fans. Can there be any performance degradation over time? If yes, what is the solution?

Please note the two main requirements

100% CPU usage for all CPUs and,
the concern is about continuous running over say, years.

Are the servers enterprise grade? Do you have appropriate cooling and powering systems? Can the application survive one or more servers down? — Romeo Ninov, May 24 '23 at 12:56
No it can come from consumer class CPUs, say Intel Core for (6C, 12T). As I mentioned, there is nothing special about cooling.. these CPU come with their fans and the entire thing will be placed in an air conditioned room at say 20 degC. All CPUs are need at all times. — Peedaruos, May 25 '23 at 14:16
Consumer grade hardware will not last very long being used in this way, certainly not years. If the application is important, and it's worth running it for that long, then it's probably worth investing in proper enterprise grade workstations/servers. — user1751825, May 25 '23 at 14:26
If you're hoping to save money by using consumer grade, you're perhaps not considering the cost of the electricity to run the computers and air conditioning. The purchase price may be quite insignificant compared to the running costs. — user1751825, May 25 '23 at 14:30
Cheaper hardware will potentially fail more frequently/sooner making a proper cluster/workload design that allows for failure and accommodates re-running uncompleted work packages from a failed node essential. — HBruijn, May 25 '23 at 14:33
The larger the number of compute nodes, the more likely that over time you will see failures and misbehaving nodes, almost regardless of the quality of your hardware. If you for example use memory that has a mean time between failure rating of 1.5 million hours (±170 years) and you have 50 nodes with each 32 memory slots (1600 pieces of RAM), then you'll have ± one strip of memory failing every month. In a larger cluster, with 300 nodes for example, that will mean a memory failure every week. That is almost unavoidable and you work load needs to be be designed to able to cope with failures. — HBruijn, May 25 '23 at 14:44
Does this answer your question? [Can you help me with my capacity planning?](https://serverfault.com/questions/384686/can-you-help-me-with-my-capacity-planning) — djdomi, May 25 '23 at 17:21

score 5 · Answer 1 · answered May 25 '23 at 16:49

how long will it be able to sustain without any physical damage?

If you buy decent production-quality servers then no, you shouldn't see any damage. In fact there's an argument that you'd see less than servers going between running hot and cold as thermal-shock can damage components more than being on all the time.

Can there be any performance degradation over time?

Not really, not on any solid-state components anyway, I suppose your PSUs might get slightly less efficient, your fans may even degrade a bit as they get covered in dust.

Obviously no amount of planning will stop components failing mid-life, buy if you design your clusters to handle that sort of thing it doesn't have to be business-impacting.

Thanks, but I do not understand by the 'production-quality servers' you mentioned. We are talking about consumer grade CPUs. For example I have the Intel i5 10400F or AMD Ryzen 5 5600G with (6C, 12T) in mind. I am aware that several top notch Xeon or Epyc CPUs can certainly tackle the requirements. — Peedaruos, May 27 '23 at 13:23

Running a cluster 24x7 at full load. Possible damages?

1 Answers1