-8

Before I start, let me inform you that I have been working as an IT Manager for large international companies for my whole life, and I have received masters degree from IT technology. I therefore believe, I am fully competent to do what am doing now. However, now I face an uneasy challenge which I have never seen before. My troubleshooting is very limited because every single failure in solving it costs me significant amount of money.

I am working on my startup project, part of which is having a storage server with 72 TB of storage space. I have built the storage server myself, as I have built hundreds of PCs and servers before. My problem now is, that the server keeps destroying the hard drives.

After turning on the Server, all the hard drives either burn with a cloud of smoke and a burn mark on the HDD board or are not recognized on any other PC where I connect them to afterwards.

As my resources are limited, I have built my server from value parts where possible:

As you probably understand, I can not troubleshoot and test my progress on additional hard drives. Every failure would mean another HDD destroyed. I have destroyed already 12 of brand new WD Red 4TB HDDs.

I came here for an advice, how to troubleshoot and identify the broken component. Would purchasing a multimeter and measuring power output on key connectors help with my problem? How should I progress? Do you have any other idea?

What you believe can be causing the problem? Of course, all the connectors are correctly connected. This was the first thing I have checked. Moreover, they would not fit with any other connectors, so they are surely correctly connected. My motherboard behaves correctly, it does not randomly reboot.

In this situation any advice will be worth of considering. But please, remember, I do not have any spare PSU or chassis with 20 bays to replace and test again.

The 20-bay storage chassis has backplanes which connect the HDDs together. Do you think that there might be something wrong with the backplanes that would result in such problems?

Thank you in advance.

Bunkai.Satori
  • 117
  • 2
  • 10
  • Start by checking the voltage on each rail of the SATA power connectors with a multimeter, no need to burn more hard drives. –  Mar 07 '15 at 17:11
  • 5
    Oh and sidenote : that machine can barely be called a "server" due to the consumer grade components like the motherboard or PSU. –  Mar 07 '15 at 17:14
  • 3
    @AndréDaniel, I believe, purpose defines server not its components. A server serves a service. Something that is a workstation now could have been a server 5 years ago. Thanks for your advice with the multimeter. – Bunkai.Satori Mar 07 '15 at 18:39
  • 8
    I disagree, a server is about production service and that entails two things; retaining the data you have and being as available as possible - value, performance, manageability are all very important two but not compared to these two and you seem to have problems with both of these issues when it's actually very easy indeed to build a secure and reliable server of that size using server-grade parts. – Chopper3 Mar 07 '15 at 18:48
  • 1
    Just throw everything in the bin and by an off the shelf server. – user9517 Mar 07 '15 at 20:43
  • @Iain, its like saying, buy yourself a Ferrari if you wish one day to have it, and do not get anything else :-) Of course, I will purchase a regular and probably 6x so expensive server, but later, when the payment will not be so painful. A HP server with 72 TB of RAID 6 space would cost me ca 15.000,- Eur. I was able to purchase this for something over 4000,- Eur. – Bunkai.Satori Mar 07 '15 at 22:15
  • 2
    Your 'server'is burning through disks - it's dead move on. End. – user9517 Mar 07 '15 at 22:18
  • 1
    The reason why you buy off the shelf servers is because, in theory, your production time is too valuable to take risks on unproven combinations of hardware and drivers. Either you pay this price up front for minimal downtime and professional support, or you inevitably pay it later *with* downtime and your own support hours. – Andrew B Mar 07 '15 at 23:15
  • Hmm, instead of analyzing why have not I invested 15.000,- EUR but just 4.000,- i would really prefer, if you guys come with some useful advises how to solve existing situation within the constraints given. – Bunkai.Satori Mar 07 '15 at 23:19
  • 4
    @Bunkai.Satori There _is_ no professional solution with the constraints you've imposed. It's like asking for a Ferrari, but demanding it's made out of straw and dead cats. – Wesley Mar 07 '15 at 23:26
  • 3
    You've invested a lot more than 4.000 EUR if you count your time. Unless you're working for free, the cost of your hourly salary times the number of hours you've spent is probably edging up towards the cost of an off-the-shelf server by now. – Katherine Villyard Mar 07 '15 at 23:28
  • @KatherineVillyard, hi Katherine. This is my own project. My project I work on in my available time. I have't decided too much time to build that sever. Basically, i knew what I was doing. However, now I feel like I am apologizing, why I went through this path. I simply went, it is a fact. After my analysis, this was the best decision to make. Just accept that. I would more prefer any practical ideas to the problem which I have. This is the reason, why I came here, not to apologize, why I haven't purchased complete solution. – Bunkai.Satori Mar 07 '15 at 23:47
  • Having to guess, you may have a shorted VR on the ATA bus. I once had a failed MB with this problem. The symptoms were frequent ATA errors reported on console. A simple visual inspection found the VR physically cracked near the IDE connector, on the motherboard. – davide Mar 27 '15 at 23:58

2 Answers2

5

I'm a bit iffy on the competence bit. I'm a IT management grad and they don't teach you squat about hardware. There's a few simple truths here

  • At some point of time dead hardware is dead hardware.

Time/effort costs money. You may not be able to fix this

  • Hard drives arn't free

    well unless you have a service contract that covers everything. We do. Our supplier will send us new drives via DHL in 4 hours for our drive enclosures. There's a reason real server stuff costs money

  • STUFF IS BURNING OUT is never a good sign.

The magic smoke must not escape

  • Damn it jim, you're an IT manager, not a hardware engineer

You actually don't really have a good enough understanding of hardware to fix it. Hell, our supplier just swapped out our entire enclosure when we had some small part break.

If its new? Its under warranty. Use it.

I'd also consider a few incorrect notions you would have. Old servers arn't workstations in most places (We run our servers to the ground, and our workstations get rotated down. We don't use our servers as workstations). A server would have shiny things like redundant power (which a workstation would not) and a workstation would be an e-atx box, rather than a rackmount.

School dosen't count for much sometimes, common sense does, and common sense is your hardware is broken and you need to get it replaced under warranty if its new and the damn thing is eating hard drives

FWIW, its the enclosure.

Journeyman Geek
  • 6,977
  • 3
  • 32
  • 50
  • Hi and thanks for your response. If you would read my question carefully, you would see what hardware is new and what is reused. None of the reused hardware is related to my problems. As an IT Manager, I built hundreds of configurations which gives me the level of expertise needed to do what I do now. If you were reading carefully, you would understand that I need to identify wrong unit. I do not want to send for repair units which are working properly. What you suggest is sending for repair everything under warranty, that will cost time and looks very unprofessional. – Bunkai.Satori Mar 07 '15 at 23:26
  • 3
    @Bunkai Please, just stop. If you're having to repeatedly bring up your credentials, you're using them as a crutch. *We don't care.* – Andrew B Mar 07 '15 at 23:31
  • 4
    Professional? Professional is calling up the contractor and yelling at them for selling you shit that's broken. Your disk array is probably where the problem is. Call up your vendor, and yell at them. Point out you have had 12 disks burnt out. *Our* vendor would probably send out a service engineer and take care of it. I'm sorry but no matter how fun hardware is, a true professional would know when to let *another* true professional do his thing – Journeyman Geek Mar 07 '15 at 23:33
  • 2
    I'd also add that a disk array that is burning out disks is a VERY large paperweight, and the arguments about costing time is plainly illogical. Its costing you time now - every moment till you pick up that phone, or firing up the email client, and letting your enclosure vendor know he sold you a bum unit. – Journeyman Geek Mar 07 '15 at 23:37
  • @AndrewB, no you please stop! Instead of talking to the point you pick up my credentials and are over focused on them. If you do not get why I put there the lines, ok, I will help you: just to tell you that I know what I am doing, that you do not need to ask me elementary questions like if the cables are properly connected.. I do not have time for excusing here the way I asked. Either stick to the point of my question please or do not waste my time! – Bunkai.Satori Mar 07 '15 at 23:41
  • @JourneymanGeek, you are shifting your focus to dimensions I do not want to analyse and discuss. My question is clearly stated. I fyou have anything to the point, anything helpful, a good advice, I will gladly read it. Please do not waste my time with this invaluable information. – Bunkai.Satori Mar 07 '15 at 23:44
  • 1
    @Bunkai.Satori If you knew what you were doing, you wouldn't be doing this in the first place. – Wesley Mar 07 '15 at 23:53
  • 1
    @Bunkai Without getting further sidetracked over our existing back and forth...RE: "dimensions I do not want to analyse and discuss", Serverfault Q&As aren't a simple matter of "give me the answer I want within the criteria I define". If we think you're misguided, we're going to tell you that, because the alternative is to let everyone who cruises in from Google take this as an example that they should be doing the same things in their workplace. – Andrew B Mar 07 '15 at 23:58
  • 2
    You've burnt out 12 disks. Your way isn't working. Maybe, just maybe just listening to a guy who actually babysits much bigger, critical disk arrays might be a good idea, especially when its an easier, cheaper solution? – Journeyman Geek Mar 08 '15 at 00:00
  • @Wesley, not sure I know what you mean. You mean, why did I reuse one PC I had and have decided to make a server from it? The answer is simply, to save some money. You believe that saving money is wrong? Ok, but i dont. – Bunkai.Satori Mar 08 '15 at 00:21
3

Such catastrophic failures can be caused only by a much higher voltage on the power rail. It should be relatively simple to use a multimeter to measure the current/voltage going to the SATA power connector.

As you mention a (custom built) backplane: have you tried to connect a single hard disk directly to the power connector, bypassing the backplane-provided power?

shodanshok
  • 47,711
  • 7
  • 111
  • 180
  • hi and thanks for your response. Yes, besides the 12 HDDs in the main hdd-bay area, I have two 2.5 inch HDDs which will keep the database and the OS. These were connected the whole time directly to the PSU and bypassing the backplanes, even during the server crash. These two were intact. These two WD Blue 512GB are the only two HDDs which work until now. – Bunkai.Satori Mar 07 '15 at 18:43
  • What could these backplanes do to have such an effect to my HDDs? If they get the balanced and correct amount of power, they can not simply multiply the available power to have it enough to burn the HDDs, can they? – Bunkai.Satori Mar 07 '15 at 18:47
  • 2
    So you found your problem: the backplane. Maybe a short-circuit or something similar. With the help of a multimeter your should be able to track down the specific issue. – shodanshok Mar 08 '15 at 07:39
  • Thank you for your comments. Yes, I will focus on the backplanes. The chasis has 5 backplanes, and into ech 4 HDDs are connected. It looks to me highly unprobable that more of them would be broken, just because all the drives which were destroyed. The drives were connected to 3 different backplanes. – Bunkai.Satori Mar 08 '15 at 12:57