Major issues with fsck of 10TB ext3 RAID 6 (memory allocation failed, etc.)

Question

I recently added a 7th 2TB drive to a linux md software RAID 6 setup. After md finished reshaping the array from 6 to 7 drives (from 8 to 10TB), I was still able to mount the file system without problems. In preparation for resize2fs, I then unmounted the partition and ran fsck -Cfyv and was greeted with an endless stream of millions of random errors. Here is a short excerpt:

Pass 1: Checking inodes, blocks, and sizes
Inode 4193823 is too big.  Truncate? yes
Block #1 (748971705) causes symlink to be too big.  CLEARED.
Block #2 (1076864997) causes symlink to be too big.  CLEARED.
Block #3 (172764063) causes symlink to be too big.  CLEARED.
...
Inode 4271831 has a extra size (39949) which is invalid Fix? yes
Inode 4271831 is in use, but has dtime set.  Fix? yes
Inode 4271831 has imagic flag set.  Clear? yes
Inode 4271831 has a extra size (8723) which is invalid Fix? yes
Inode 4271831 has EXTENTS_FL flag set on filesystem without extents support. Clear? yes
...
Inode 4427371 has compression flag set on filesystem without compression support. Clear? yes
Inode 4427371 has a bad extended attribute block 1242363527.  Clear? yes
Inode 4427371 has INDEX_FL flag set but is not a directory. Clear HTree index? yes
Inode 4427371, i_size is 7582975773853056983, should be 0.  Fix? yes
...
Inode 4556567, i_blocks is 5120, should be 5184.  Fix? yes
Inode 4566900, i_blocks is 5160, should be 5200.  Fix? yes
...
Inode 5628285 has illegal block(s).  Clear? yes
Illegal block #0 (4216391480) in inode 5628285.  CLEARED.
Illegal block #1 (2738385218) in inode 5628285.  CLEARED.
Illegal block #2 (2576491528) in inode 5628285.  CLEARED.
...
Illegal indirect block (2281966716) in inode 5628285.  CLEARED.
Illegal double indirect block (2578476333) in inode 5628285.  CLEARED.
Illegal block #477119515 (3531691799) in inode 5628285.  CLEARED.

Compression? Extents? I've never had ext4 anywhere near this machine!

Now, the problem is that fsck keeps dying with the following error message:

Error storing directory block information (inode=5628285, block=0, num=316775570): Memory allocation failed

At first I was able to simply re-run fsck and it would die at a different inode, but now it's settled on 5628285 and I can't get it to go beyond that.

I've spent the last days trying to search for fixes to this and found the following 3 "solutions":

Use 64-bit linux. /proc/cpuinfo contains lm as one of the processor flags, getconf LONG_BIT returns 64 and uname -a has this to say: Linux <servername> 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux. Should be all good, no?
Add [scratch_files] / directory = /var/cache/e2fsck to /etc/e2fsck.conf. Did that and every time I re-run fsck, it adds another 500K *-dirinfo-* and an 8M *-icount-* file to the /var/cache/e2fsck directory. So that seems to have its desired effect as well.
Add more memory or swap space to the machine. 12GB of RAM and a 32GB swap partition should be sufficient, no?

Needless to say: Nothing helped, otherwise I wouldn't be writing here.

Naturally, now the drive is marked bad and I can't mount it any more. So, as of right now, I lost 8TB of data due to a disk-check?!?!?

This leaves me with 3 questions:

Is there anything I can do to fix this drive (remember, everything was fine before I ran fsck!) other than spending a month to learn the ext3 disk format and then trying to fix it manually with a hex editor???
How is it possible, that something as mission-critical as fsck for a file-system as popular as ext3 still has issues like this??? Especially since ext3 is over a decade old.
Is there an alternative to ext3 that doesn't have these sorts of fundamental reliability issues? Maybe jfs?

(I'm using e2fsck 1.42.5 on 64-bit Debian Wheezy 7.1 now, but had the same issues with an earlier version on 32-bit Debian Squeeze)

I don't want to rub it in, but **no** filesystem is reliable enough to make it possible to not have a backup, especially before such dangerous operations. So, just start your restore and if you don't have a backup, don't blame anyone but yourself. — Sven, Jun 21 '13 at 18:37
`fsck -y` is a bad idea. You always want to know that there are problems before you decide to do something about it. In your case, you could have noticed that `fsck` was reporting way too many errors and investigated before actually making any changes. — longneck, Jun 21 '13 at 18:40
@SvW I guess I wasn't aware that fsck is a dangerous operation... I naively thought it was supposed to FIX, not BREAK things... — Markus A., Jun 21 '13 at 18:40
@longneck Agreed that fsck -y is probably bad, but I'm not sure what else to do. One guy in another forum complained that he had cancelled fsck when it came up with a lot of errors and that broke his fs. So, basically, my understanding right now is: DO NOT EVER USE FSCK EVER! (Because -y shouldn't be used, but aborting it once it runs isn't safe either) — Markus A., Jun 21 '13 at 18:44
@MarkusA.: My approach is to use `fsck` according to it's name: File System CHeck, it says nothing of a repair. Once the damage gets significant, I pull out my backup. — Sven, Jun 21 '13 at 18:46
@MarkusA. No, now you're just over-reacting. Using `fsck` to LOOK for errors is safe. If a file system is broken, then you've probably already lost data. `fsck`'s primary goal is to make the filesystem consistent. Data recovery is just a (sometimes happy) by-product of making the filesystem consistent. — longneck, Jun 21 '13 at 18:49
Sorry... just a bit frustrated with the whole situation... Because, unfortunately, in this case I don't have a backup for all of it. It's not a huge deal, because none of the lost data is absolutely mission critical, but the raid did contain things I would have rather kept... Is it really so hard to make a disk CHECK tool, that doesn't make things worse??? I'd be totally fine if it said: "There's problems that we don't think we can/should fix. Mount the drive one last time and try to copy as much of your data as you can!" But "I found an error. Let me kill the ENTIRE drive." is just weird, no? — Markus A., Jun 21 '13 at 18:54
That's like Ford saying: "Let's replace the airbag with an explosive device that blows up the entire car instead. At least then, no driver will ever have to suffer for too long..." — Markus A., Jun 21 '13 at 18:56
I make this a comment as it would be downvoted so much otherwise: use a decent filesystem instead of extN filesystems. Use XFS if you want a simpler solution. Or use ZFS. For this kind of scenario ZFS just works better. — cstamas, Jun 21 '13 at 19:23
@MarkusA. Again, no. Your analogy is completely off-base. It's more like air bags reduce crash fatalities by 80%, but they also cause an additional 1% of injuries that would not have occurred. Statistically you're better off with air bags even though you're trading one type of trouble (the 80% reduction) for another (1% of cases where the airbag causes an injury). Similarly, just because `fsck` seemingly made your problem worse doesn't mean the tool is flawed. Remember: by specifying `-y` you told `fsck` to blindly fix without warning. You chose to drive with your nose against the airbag. — longneck, Jun 21 '13 at 19:30
@cstamas I've actually been eyeing ZFS for a little while. It looks amazing in terms of features. So far I just haven't convinced myself to trust it as it's not officially part of Debian yet and I'm always a little skeptical of external "build them yourself" tools. I prefer them to be tested in the actual environment where I would want to use them. But maybe xfs is a good way to go... — Markus A., Jun 21 '13 at 19:32
@longneck Agreed... 1% extra injuries do happen. They also happens when you run chkdsk on DOS and you end up with a directory being renamed to FOUND.000 and the files inside it FILE0000.CHK, etc. Then you go in with a hex-editor, identify the files and rename them back to the right name and you're good. I can accept that. But do you REALLY think it is ok for fsck to make the ENTIRE disk unusable? That should NEVER happen under ANY circumstances! Also, wouldn't you expect it to be bug-free by now? Why are there memory allocation errors? And this problem has been around the forums for YEARS!!! — Markus A., Jun 21 '13 at 19:41
`fsck` is a beast that should not be taken lightly as you have seen. It may find errors that are not errors if they're caused by a configuration issue because it "expects" the filesystem to be in some order and it's not. However, it does *not* make changes unless you tell it to. `fsck` is a **very** dangerous option to use without backups. — Nathan C, Jun 21 '13 at 20:00
@NathanC I would consider this a major robustness flaw in the design of the file system. And how beastly can e2fsck possibly be? All it needs to do is verify a couple of reference pointers... Chkdsk and Scandisk on FAT file systems don't have these kinds of issues. I have NEVER before in my life lost data to a disk-check-utility. Especially on a RAID system that didn't even have any drive failures... But it's good to know that I need to be careful here. — Markus A., Jun 21 '13 at 20:07
I think you're being a little harsh on ext3, here. You performed two majorly dangerous things, the first being to try to expand the number of columns in a live RAID, the second being resizing a file system. The second of those is not a regular happening, and the code paths are not all that well trodden. But the first is crazy dangerous; I've never done it in 20+ years of UNIX sysadmin. If you don't punch ext3 in the face, it probably won't beat up on you; but if you do - and you did - then you can't be too surprised when this sort of thing happens. — MadHatter, Jun 21 '13 at 20:38
@MarkusA.: It is absolutely well understood that RAID is not backup and that if you care about data, you must have a backup. The exact mechanism by which you lost your data is basically irrelevant because so many such mechanisms are possible. That's why you have to have a backup. If you cross the street without looking both ways, you can't blame the speeding truck that hit you because if not for the speeding truck, it would have been the car that wasn't speeding, the bus, or the motorcycle. — David Schwartz, Jun 22 '13 at 02:07
All I'm saying is that I was hoping for bug-free tools and obviously, with fsck (and maybe md?), that's not the case. I don't see why growing a RAID or running fsck (I hadn't even gotten to the resizefs point) should be highly dangerous operations, especially in the face of TWO redundant drives, and no hardware issues anywhere. Shouldn't it be possible for things like RAID, journalling, error correction, check-summing,... to prevent these kind of problems? It's just a collection of bytes! How hard can it be? NASA can land a MINIVAN on Mars by lowering it from a rocket platform for chrissake... — Markus A., Jun 22 '13 at 06:54
@MarkusA. No code that complex is bug free. However, in my experience, things like this happens only if there are either some kind of hardware issues involved like bad disks, controllers, connections, power outages or alternatively some user error of whatever form. Whenever a FS blew in my face, there was *always* something non-normal involved and when it comes to that, every type of FS can fail you miserably, be it NTFS, extX, XFS, reiser or even ZFS. So, if you search for a FS where nothing will ever happen, you can stop looking now, none exist. Think about proper backup instead. — Sven, Jun 22 '13 at 07:54
Regarding curiosity: No one ever invested 2.5 billion USD of taxpayers money into a file system, and since unlike a Mars mission where there are no second chances, file systems usually have safety nets in the form of backups which makes efforts to further increase the reliability of some code economically unwise after some point. — Sven, Jun 22 '13 at 08:04

score 3 · Answer 1 · answered Jun 21 '13 at 18:37

Just rebuild the array and restore the data from a backup. The whole point of RAID is to minimize downtime. By messing around and trying to fix a problem like this, you just increase your downtime defeating the whole purpose of RAID. RAID doesn't protect against data loss, it protects against downtime.

Markus A. · Accepted Answer · 2013-08-12T17:05:36.923

After playing around with fsck some more, I found some remedies:

Preventing the 'Memory allocation failed' error

fsck seems to have a major issue with memory leakage. If it is run on a file-system with some problems (real or imaginary), it will "fix" them one-by-one (see screen dump in original question). As it does so, it consumes more and more memory (maybe keeping a change-log?). Pretty much without bounds. But, fsck can be cancelled at any time (Ctrl-C) and restarted. In this case, it will continue where it left off, but it's memory use is reset to next-to-nothing (for a while).

With this in mind, the three things that need to be done are:

Use 64-bit Linux (it seems to make a difference in how fsck can use the available memory)
Add a ridiculously huge swap partition (I used 256GB, fsck runs for about 12 hours with it)
Frequently abort and restart fsck (how frequently depends on the size of the swap partition)

NOTE: I have no idea if canceling and restarting fsck brings with it any other dangers (probably does), but it seems to work for me.

Dealing with the resulting damage, if the 'Memory allocation failed' error occurs (IMPORTANT!)

fsck handles the Memory allocation failed error in the worst possible way: I destroys perfectly good data. I'm not sure why, but my guess is that it does some final data-write to disk of things that it had kept in memory, which (due to the error) have meanwhile gotten corrupted.

In my case, the most visible problem was that when I restarted fsck after the error, it sometimes reported a corrupted super-block. The problem is: I have no idea how corrupted the super-block was, especially in the cases where it didn't report it as corrupted. Maybe, if restarted after the error, it then uses incorrect drive meta-data found in the corrupted super-block to do all further checks and ends up fixing "issues" that aren't really there, destroying good data in the process.

Therefore, if fsck ever dies with the Memory allocation failed error, it needs to be restarted using the -b parameter to use a backup super-block that (hopefully) wasn't corrupted by the error. The location of the backup super-blocks can be found using mke2fs -n /dev/....

Since I don't know what happens if fsck dies with the backup super-block selected, I usually just abort fsdk immediately when it gets to Pass 1: Checking inodes, blocks, and sizes and restart it again without -b, at which point it starts without complaining about a bad super-block. I.e. it seems like the first thing fsck -b does is to restore the main super-block.

Now the one we've all been waiting for:

How to mount a file-system without letting fsck run to completion

This, I found by accident: It turns out that after running fsck -b and aborting it as soon as it prints the Pass 1: Checking inodes, blocks, and sizes (before any errors are found) the file-system is left in a mountable state (Yay! I got pretty much all of my data back!).

(Note: There may be another way using mount -o force, but it wasn't needed in my case.)

How to avoid all these issues in the first place

There seem to be two ways:

Use ext3, but keep a perfectly up-to-date backup. Then, frequently run fsck with parameter -N. If it shows any problems, delete the entire fs and restore everything from the backup. Since, in this scenario, one would be relying very heavily on the backup, I suggest keeping a backup of the backup. Also, use a copy-tool that somehow ensures that the restore does not create random errors in the process (An MTBF of a trillion r/w-ops is small when dealing with TB's of data). Make sure to plan for the resulting down-time, too, as a multi-TB restore probably takes a while...
My recommendation: Do NOT use ext3! The fs-design and associated tools (here fsck) aren't robust enough for real production use (yet?). The way fsck handles the memory error and the fact that the error occurs in the first place are not acceptable in my mind. I will be trying xfs from now on, but don't yet have enough experience with it to tell whether it's any better.

score 0 · Answer 3 · answered Sep 06 '18 at 16:29

Unfortunately, I'm not able to "add a comment" but had to chime in here and thank the Op. I had a RAID6 failure and manually assembled 6 of the 8 drives with closely matching Event Counts. However I wasn't able to mount the assembled array.

It appeared that I needed to use a backup Super-block. Running fsck -b <location> ... eventually died with out-of-memory, which led me to this thread/question.

In short, using fsck -b <location>... and then doing ctrl+c allowed me to mount my array and recover my files.

Thanks!

I can't believe this is still happening in 2018... Glad you got your files back... I no longer use ext3 and have been much happier ever since. — Markus A., Sep 24 '18 at 21:27

Major issues with fsck of 10TB ext3 RAID 6 (memory allocation failed, etc.)

3 Answers3