3

I have a org.apache.solr.hadoop.MapReduceIndexerTool/MorphlineMapper process that fills the local '/' mount.

It runs for a few minutes, the disk fills, Nagios alerts get triggered, and then I kill the process. Once the process is killed, the file utilization drops back down to its base-level of 40%.

disk usage chart

This happens fairly quickly and, since it's a production system, there isn't a lot of time to peruse the filesystem to see which files are new. There are also a couple NFS mounts that cause du -sh * to hang. We're running RHEL 6.7.

Is there a smart way to figure out what, exactly, is filling the disk? Perhaps a fast way to capture, diff, and aggregate the file-sizes from lsof? I imagine this is a fairly common scenario and so may be a nice awk one-liner that's in every sysadmin's toolkit.

1 Answers1

2

This event looks like 8 minutes from before it happened to full. That would be an amazing response time to avoid manually, especially if the administrator was not on the system when it began.

You need more reaction time. Give it more much more space to chew on. Throttle or limit the job in some way.

iotop is a nice python script to see processes doing the most I/O. Which likely contains your runaway. It can have decent batch output with the right options, say iotop -bkto.

HBruijn
  • 77,029
  • 24
  • 135
  • 201
John Mahowald
  • 32,050
  • 2
  • 19
  • 34