1

I’m sorry if this should be on SuperUser instead of ServerFault. Please ask me to migrate the question instead of flaming.

I’ve had 2 windows desktops go down on the network in the space of one month, One windows 7 and the other Windows 8 in a network of 6 machines with PDC and another DC in Azure with a few other machines on a virtual Azure network.

The machines are 2 year old Asus I7 4 core 8 processors with 32 gig memory and SSD main disk. The machines are being run in a development shop so everybody got everything installed. The 2 machines that went down are running local sql servers (and one mysql and postgress also).

The first one went down and we blamed the ssd disk for the crash. But some aspects of the crash made a few warning lights go off in my head but being swamped (developer and trying to bang some sense into the network) did nothing.

Ok Then my machine having quite full main system disk (SSD), decided to run the disk cleanup utility to clean up system files. I noticed that I had 192 gig in system files, thought nothing of it and ran it. Few hours later I started getting strange vibes from the machine and started the task manager… file not found error! Went straight into system32 and lo and behold, no files but those locked by the file system where left.

Tried to download virus scanners but it could not install because the UAW exe was gone. Managed to get a malware scanner down (did not need an install) which did not give me any good reason for the situation. I went to another windows 7 machine and managed to copy all the system32 files to my file system. And my intention was to do a save reboot and copy the files manually to system32 and hopefully get it running (Got a deadline staring at me), but of course that did not work, the boot sector was gone.

The shadow copy folders where gone and the restore points where gone too. So I had to clean install it. The disk is not reporting any errors.

I scanned the network and found a hidden service on the PDC (rootkit). But I know of no virus that does this kind of damage.

So finally the question is.

Can a disk crash on a SSD disk behave like this? And if not what kind of virus can do this kind of damage.

Edit

I know the network is compromised and needs to be reinstalled. But the question is are the clients going down because of a virus or can this be a SSD disk crash or a windows update failure (Which is the company owner's answer to it all, and he only wants to remove the rootkit and then continue.)

Archlight
  • 113
  • 5
  • ssds , with SQL servers and nearly full could exactly behave like you describe. you should check the smart values, often you have something like "wear leveling count" or similiar values. – Dennis Nolte Jul 18 '14 at 07:27
  • Thx, this put me in the right direction, wear leveling count = 97 which is not getting close to my 10.000 write cycles. But 100 uncorrectable error count, ecc error rate 200.. and on and on.. Yes it looks like the disk failed. but has recovered. Please write this as an answer. – Archlight Jul 18 '14 at 09:36

2 Answers2

2

I think you might be in a bit over your head...

For starters, there aren't PDCs anymore, and that concept is a long time gone. Read about FSMO roles.

If you found a rootkit on one of your DCs, you need to level it and make a new one! You also need to do root cause analysis and figure out how it got there, because if you don't, it will keep happening. You should be restoring from a DS backup but if you don't have one you could always add a new DC to the domain before removing and paving the compromised one. Don't try to remove viruses in cases like this; the cost of missing something is very high and the persistence mechanisms can cause random problems later.

Maybe your boxes being down is the result of viruses with domain admin; maybe not. You don't seem to have any information about that. However, rebuilding PCs that have been joined to a compromised domain is never a bad plan and you could start there, with at least one of them. But, maybe it is a coincidence. Either way, since you're overwhelmed trying to micromanage system performance and remove viruses like a computer enthusiast on a network of less than 10, you can probably see why it's bad sysadmin practice.

It's worth noting as well that slow access times are not a usual symptom of a faulty SSD. Also, the being down could be literally anything from what you've mentioned, but what sticks out for me is that you seem to have a total security compromise. Start there, with the paving.

Falcon Momot
  • 25,244
  • 15
  • 63
  • 92
  • I know the DC needs to go. For 6 users I am inclined to just tell them to start all over. For the root cause its really hard to know. I used to be a sysadmin along with development back in the days of win 3.5 to win 2000. And really don't want to be doing this. – Archlight Jul 17 '14 at 08:57
2

As you already have written it seems you actually have a fallen or soon to fall SSD.

Having an SQL Database and a nearly full SSD can result in fast "degrading" quality of the ssd.

Best for something like this to at least have some values for anticipation is checking the SMART Values of the ssd.

Some of the important values are "Wear-Leveling Count" and "uncorrectable error count"

Depending on your SSD you can theoretically get a lot of (10000 or even more) repeated writes on one cell, but this might even happen fast than you think when all the data is still used and garbage collection can only recycle some of the cells.

Sure enough the controller of the SSD usually takes care of that, but only during the last 1-2 years the controller did get significantly better.

Basically summit:

SSD broke.

Advise: split OS + application on at least 2 seperate disks/SSD, get some raid to prevent downtime, and never forget backups.

Dennis Nolte
  • 2,881
  • 4
  • 27
  • 37