2

It's not recommended to use ZFS for a computer without ECC RAM. So, what's a good alternative then? Or is the risk the same, so it doesn't matter what manager I use, it'll be the same problem if a bit in RAM flips anyway?

I'm trying to determine if I should or should not use ZFS. If I shouldn't, is there anything that comes close that's safer with non-ECC RAM?

  • You don't have an option to just use a system with ECC RAM? It's a pretty low barrier to meet. If you want ZFS, use quality hardware. – ewwhite Nov 20 '14 at 09:31
  • 1
    ECC RAM does not eliminate all possible errors. You still need a plan for the case of data corruption. – usr Nov 20 '14 at 09:49
  • Please do not post on more than one site in the future. – cutrightjm Nov 20 '14 at 17:43
  • @ewwhite I currently have a desktop motherboard and desktop RAM. I don't feel like upgrading just now, so I'm going to have to suck it up with what I've got. Solid hardware I bought with the expectation to run it for many years. Clearly should have done more research for a home server. – leetNightshade Nov 20 '14 at 17:52
  • @ekaj I removed the other post and edited this one. – leetNightshade Nov 20 '14 at 17:53

2 Answers2

5

The problem is zfs's error correction features (checksum & scrub) can potentially result in a total loss of data on a memory corruption error as opposed to say xfs which will happily write your error to disk in just the affected block(s).

Andrew Domaszek
  • 5,163
  • 1
  • 15
  • 27
4

I'm using ZFS with non-ECC RAM. Widely enough. I'm not writing this to say it's safe. However, for several years I didn't see zfs corruption yet. Furthermore, when using zfs on ancient hardware, I saw all sort of memory problems, even an inability to boot up. From my experience - you will encounter all sorts of fatal kernel traps faster than the zfs data corruption when using such memory. Also, corrupted memory can also lead to data corruption when using other filesystems. Even if I'm not right, thinking that statement 'zfs checksumming will amplify the impact of data corruption instead of minimizing one' sounds illogical, you know - zfs doesn't selfheal silently. There's enough counters in zpool status to start to suspect that something is starting to happen.

After all - take you backups and store them elsewhere, ZFS isn't a silver bullet.

drookie
  • 8,625
  • 1
  • 19
  • 29
  • "take you backups and store them elsewhere." Okay, in that other system you do recommend backing up to, what would the setup basically look like? ECC and ZFS or something else? – leetNightshade Nov 20 '14 at 17:58
  • 1
    I store my backups on another ZFS system, regardless of the ECC. I'm also using snapshots on my production systems (the ones that I take backups of), and I'm sure everyone should do this too, because it's instant and cheap (only some space is consuming, but, unlike LVM, the space for snapshots is taken from the free space of the pool). But I's only me - the choice is up to you. If you want to have decent backups - I can propose you to use tape backups. It's cheap enough, and it can be stored long enough - last summer I was recovering a deleted DB from 5 years old backup. – drookie Nov 20 '14 at 18:02
  • 1
    Fun fact: the only time I got the data corruption on ZFS - it was on a server with ECC memory, but on non-redundant zfs pool: it was a HP server with SmartArray controller - so I had to use it's array since SA are not capable of passthrough. The data corruption was caused by the power loss. So I just rolled back the last snapshot. – drookie Nov 20 '14 at 18:05
  • 1
    And another zfs feature that I need to mention (even if you are aware about it, because it's still truly amazing): you can even transfer the snapshot themselves over the network. Even incremented snapshots. – drookie Nov 20 '14 at 18:07
  • Thank you, the features blow my mind! So ZFS can be safe on non-ECC, you just have to leverage it's safety features. `zpool status` is reliable, do you think it's safe to use a script to run that at start up followed by an automated snapshot? – leetNightshade Nov 20 '14 at 18:09
  • I'm avoiding saying that something is safe. I'm saying it's safe enough for me. Is it safe enough for you - it's up to to decide. Take backups. Take backups of backups, if your data deserves it. Hardware dies, server rooms got flooded, fires happens, - and, what's more important these sad thigs happen when thigs are bad enough without the need of additional disasters. – drookie Nov 20 '14 at 18:14
  • About ZFS: there are error counters on the pool. They increment each time when an error is found. After some threshold the pool becomes `DEGRADED`, even if all of the pool members are online, so check it's status with your monitoring software. Solaris has `fmadm` and `sma` subsystems to alert about this. I take snapshots with a script in the crontab. FreeBSD has a `sysutils/snap` port, as I heard (if I'm not mistaken) now it implements a zfs snapshot creation features. – drookie Nov 20 '14 at 18:20
  • 1
    there's a decent guide for ZFS here (if you're not already familiar): https://docs.oracle.com/cd/E26505_01/html/E37384/ - most of it is applicable to another operating systems too, like FreeBSD or Linux with OpenZFS. – drookie Nov 20 '14 at 18:22
  • I don't know if I want to seriously consider it just yet since I don't have that much sensitive data where I can't use offsite backups, but do you have a recommended tape drive? I saw an /unrated/ one with little info on NewEgg for $170 that hooks into a computer. Having trouble finding much else that's not an entire computer that's less than $500. [edit] What is the name of the type of tape backup you're talking about? I see mention of RDx or cartridge disk drives. – leetNightshade Nov 21 '14 at 01:25
  • I was using HP tape drives, from HP Dat24 (which is kind of ancient now) to Ultrium. External ones. There's plenty to chose from. – drookie Nov 21 '14 at 05:34
  • I just thought of this: do you know if you have to worry about bit rot in snapshots? Thanks for all of your help thus far! – leetNightshade Nov 23 '14 at 06:22
  • 1
    If you will use redundant pools (and you really should) I don't think zfs will corrupt valid data on non-corrupted pool members. Authors of the articles referenced here don't consider this part at all. For this to happen, cheksum must match for already invalid data, and not match for the valid data. The situation when checksum will not match for the both blocks is way more likely. I've reread all the articles about zfs vs ecc and found them purely theoretical, considering mostly the Murphy reality, when all the bad things happen, and there's no hope at all. – drookie Nov 23 '14 at 11:05