13

I am running du -sh in a variety of directories to find disk hogs. I got two identical servers (Dell PE2850s), both with RHEL5 and it will take significantly longer to run du on one server over the other.

For example, doing du -sh /opt/foobar will take 5 minutes on server A (which has about 25 GB in it), and on server B, the same command with the same amount of data will report back to me almost instantaneous. I don't see anything glaringly obvious when running top, etc.

Any advice is greatly appreciated.

Jon Weinraub
  • 307
  • 1
  • 3
  • 16
  • 4
    The speed of `du -s` is not dependent on the size of the data but rather on the number of files. Do both directory trees have a similar number of files? – Ladadadada Aug 07 '13 at 14:44
  • 3
    Also, `du` will work much faster if all the directory meta data (like file sizes) is currently cached. If this is the case for whatever reason on one server and not the other, it will result in large differences. – Sven Aug 07 '13 at 14:48
  • 1
    @Ladadada I would say yes there is about the same amount of files. Even when adding the asterisk to get a list of the file sizes individually takes a long time to scroll. But I am not totally sure how to verify if the meta data is cached or not. – Jon Weinraub Aug 07 '13 at 15:06

4 Answers4

8

If you have huge number of files in that directory and the contents of the directory constantly change, the directory entry itself gets fragmented over time. Then when the OS is reading the directory contents, there will be lots and lots of unnecessary disk seeks. This happens especially with ext* filesystems (ext4 might be better though) and the old ReiserFS v3.x filesystems (if that got past 85% full or so).

The solution is quite easy:

cp -pr origdir newdir
mv origdir origdir.bak
mv newdir origdir

Of course if everything is cached in RAM, this does not matter that much; usually Linux caches frequently accessed files and dirs quite aggressively. If you truly want to keep the contents of those directories in RAM, you can put something like ls -lah /your/dir 2>&1 >/dev/null to your cron.

EDIT: Oh, one thing popped on to my mind. If your server has a battery-backed up RAID controller with some cache in it, please check that the battery is OK. I've seen situations where the battery is dead and the controller disables the cache completely, ruining the performance very bad. For example HP servers might tell in the iLO logs something about the controller battery; in the actual server health dashboard everything seems to be fine and green, but only the log entry will tell you about this.

Janne Pikkarainen
  • 31,852
  • 4
  • 58
  • 81
  • 1
    This will probably take me some time to do, it is on a production server so I will need to do it overnight and the entire directory contains several hundred gigabytes of data so I don't want to bog it down... I will report first thing tomorrow morning. Thanks for the idea. – Jon Weinraub Aug 07 '13 at 18:24
  • I am still running this command and no telling how long it will take. I even reniced it and cp is still running, been about 1hr15 min since starting it. Even running a du on that folder in another shell took a long time, but you think I should just `umount` the drive and `fsck` it? – Jon Weinraub Aug 07 '13 at 22:17
  • Just let it run unless it bothers your production somehow. With RHEL5 and its default CFQ I/O scheduler you can put the cp command in the idle class so it won't bully the other processes: `ionice -c3 -p $(pidof cp)` or so. – Janne Pikkarainen Aug 08 '13 at 05:49
  • Please also read my latest edit. – Janne Pikkarainen Aug 08 '13 at 08:13
  • Oh so renice -15 wouldn't of helped? Though this is for one dirctory, there are several others that I would be monitoring for du. But the battery part intrigues me, because, iirc, there is one. One thing also I discovered, or rather forgot about, it is LVM. So the data is mounted to `/` on `/dev/mapper/VolGroup00-LogVol00` so if I wanted to fsck it to see if there is error or fragmentations, I will need to `init 1` it since I cant `umount` the active partition. – Jon Weinraub Aug 08 '13 at 13:14
  • 1
    I know it has been a while but I finally got around to do the cp command you mentioned about. It two two hours to copy 25 GB. After doing hte move, running another du -sh was just as slow. In fact even erasing the backup directory is slow too! – Jon Weinraub Oct 10 '13 at 21:59
3

I suggest to try the simple du command without any switches. You will eventually see which directory is slowing down the process. Might be a faulty disk, or some other reason, ...

Király István
  • 377
  • 4
  • 10
0

Just another trap I myself encountered

I forgot that I had a huge network drive mounted into a folder within my directory tree.

O despicable me!

$mount

to see if that's the case then

$sudo umount /path/to/mount/point

Maybe symbolic and hard links may also be an issue for this use case?

grenix
  • 101
  • 3
0

Listing the directories and running du command did the trick

ls -lrt | awk '{print $9}' | xargs du -sh

jaya
  • 1