1

It just something i noticed here at work. We all use our workstations heavily troughout the day for years and we almost never run into problems. But our various dedicated servers go from bad to worse and fail more and more.

I understand servers do get more constant use, especially the hard drives. But still i'm wondering if the numbers add up.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
Sander Versluys
  • 911
  • 1
  • 10
  • 12
  • Can you elaborate on what happens with these servers so consistently? Do you have admins/IT staff maintaining them? – John Virgolino Apr 30 '09 at 11:42
  • We have 15 dedicated servers managed abroad by the hosting company. By late we experience outages almost weekly. The servers probably need upgrades... – Sander Versluys Apr 30 '09 at 11:48
  • For example last time, the hard disk crashed, luckily in raid 1, but a day later the controller gave up... bad luck i guess – Sander Versluys Apr 30 '09 at 11:49
  • I would argue that in a well run server room, a crashed disk would be replaced in minutes, and the mirrored pair restored in hours. A day later? Who are these outsourced datacenter folks? Inquirering minds what to know! – Stu Thompson Apr 30 '09 at 20:38

4 Answers4

5

A couple of thoughts come to mind. Maybe...

  1. Your servers are not discrete, but a system of servers. Complexity kills.
  2. The number 1 priority of who ever is running your servers is not stability...but rather something else
  3. The folks maintaining the servers are operating in a reactive manner, and not proactive
  4. Your server environment is undergoing continuous change by necessity. New software comes with risk.
    1. higher load means new software or new kit
    2. constant updates with new and sexy features

And could you qualify 'crash'? Do you mean OS-level big-boom? Or the web server goes down because of a misbehaving application?

Stu Thompson
  • 3,349
  • 7
  • 31
  • 47
  • I agree with point 3, but how can you forsee something like a failing harddrive or raid controller? – Sander Versluys Apr 30 '09 at 11:51
  • a failed hdd should not take down a system. raid controllers don't go very often, and it is technically possible to have a backup file system in place. BUT, what more concrete examples would be things like monitoring temperatures, CPU load, memory usage, disk space, etc...all which can lead to failure if not dealt with when the first warning signs come. Many things are preventable, and if not then downtime can be minimized to be almost unnoticeable. – Stu Thompson Apr 30 '09 at 11:59
2

Not my experience; unless a server is faulty they almost never crash; workstations crash a lot.

Edit: Oooh, with one exception. Citrix servers crash a lot; we get lots of out-of-memory errors and a full collection of every software bug on the planet; this is because they're running user-space applications which are much more likely to have memory leaks.

Richard Gadsden
  • 3,686
  • 4
  • 29
  • 58
2

That makes me wonder about the environmental conditions of your server environment. Have proper air handling, temperature control, humidity control, etc.? If that's the case, then are you tracking each and every hardware problem on the workstation side. We actually see more workstation related repairs than server ones and our servers outnumber our workstations by quite a bit. But the perception is that our servers fail more often because when they do, it impacts more people and represents a greater hardship as a result.

K. Brian Kelley
  • 9,034
  • 32
  • 33
  • Well, your last point is probably true, but i think servers have a higher failing percentage than workstations. I mean in a lot of companies there less servers than workstations, so yes, there more failing workstations in absolute numbers. – Sander Versluys Apr 30 '09 at 11:53
  • Seconded on the environmental conditions. If a server room goes over temperature even once for an hour or so it can toast all the hardware in it, permanently reducing its reliability. – pjc50 Jul 01 '09 at 13:52
1

Not in my experience.

We run over 4,000 desktop PC's (staff and public terminals) and have a fairly high churn rate for forced replacement (as well as the usual "replace after 4 years")

For servers we very rarely have major problems. Yes, the odd PSU failure (you do have dual supplies?) and dead drives, but generally, good servers (HP Prolients) just keep on going.

Guy
  • 2,668
  • 2
  • 20
  • 24
  • Well i can see that, i work in a small company and the number of servers outnumbers the number of workstations. – Sander Versluys Apr 30 '09 at 11:45
  • We have plenty of servers, running into the 100's. But we have very few (unlucky if we have 1 a year) fail. (Tell a lie, we have a big Sun 6800 server that after 5 years of trouble free uptime, blew a motherboard and some RAM a few weeks ago.) – Guy Apr 30 '09 at 12:22