rapidly filling disk: how to understand what's happening?

Question

I have a org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper process that fills the local '/' mount.

It runs for a few minutes, the disk fills, Nagios alerts get triggered, and then I kill the process. Once the process is killed, the file utilization drops back down to its base-level of 40%.

This happens fairly quickly and, since it's a production system, there isn't a lot of time to peruse the filesystem to see which files are new. There are also a couple NFS mounts that cause du -sh * to hang. We're running RHEL 6.7.

Is there a smart way to figure out what, exactly, is filling the disk? Perhaps a fast way to capture, diff, and aggregate the file-sizes from lsof? I imagine this is a fairly common scenario and so may be a nice awk one-liner that's in every sysadmin's toolkit.

A perusal of the `du` man page turns up the `-x` option, which will help you here. — Michael Hampton, Mar 08 '16 at 01:25

score 2 · Accepted Answer · edited Mar 08 '16 at 06:02

This event looks like 8 minutes from before it happened to full. That would be an amazing response time to avoid manually, especially if the administrator was not on the system when it began.

You need more reaction time. Give it more much more space to chew on. Throttle or limit the job in some way.

iotop is a nice python script to see processes doing the most I/O. Which likely contains your runaway. It can have decent batch output with the right options, say iotop -bkto.

rapidly filling disk: how to understand what's happening?

1 Answers1