2

I have a log file directory at /var/log/apache2/vhosts on a Debian dedicated server. Every day I split the main access_log into respective vhostname.com_access_log files in /var/log/apache2/vhosts

In the last week or so I have noticed the server has become unresponsive and every time this happens there seems to be multiple log rotate instances running.

After some investigation I decided to check the contents of /var/log/apache2/vhosts to see if there was some sort of problem. I can 'cd' into the directory but cannot list its contents. The 'ls' command does not die it just doesn't do anything.

After trying various different ways to list the contents of the file I decided to just delete it and create a new file. Running rm on the directory from the parent directory didn't seem to work either however: rm -rf vhosts. From a view of 'top' I can see 'rm' is running and using a small amount of cpu... It has now been running for about an hour.

Any ideas what is going on and how I can get rid of that directory?

FURTHER INFORMATION:

Seeing a lot of this type error in /var/log/messages don't know if it is related:

    Oct 15 12:22:39 sp5059b kernel: [2257339.255530] apache2       D a9b86504     0 32217   9125
    Oct 15 12:22:39 sp5059b kernel: [2257339.255533]        dea15620 00000086 00000000 a9b86504 0007f05b dea157ac c1fdbfc0 00000001 
    Oct 15 12:22:39 sp5059b kernel: [2257339.255538]        00000000 00000000 00000001 00000000 00000000 00000000 00000000 00000000 
    Oct 15 12:22:39 sp5059b kernel: [2257339.255542]        f7246ec0 f7246ec8 f7246ec4 dea15620 c02b95c6 c2483e80 d7085e80 dea15620 
    Oct 15 12:22:39 sp5059b kernel: [2257339.255547] Call Trace:
    Oct 15 12:22:39 sp5059b kernel: [2257339.255553]  [<c02b95c6>] __mutex_lock_slowpath+0x50/0x7b
    Oct 15 12:22:39 sp5059b kernel: [2257339.255557]  [<c02b945c>] mutex_lock+0xa/0xb
    Oct 15 12:22:39 sp5059b kernel: [2257339.255561]  [<c015810c>] generic_file_aio_write+0x41/0xa9
    Oct 15 12:22:39 sp5059b kernel: [2257339.255565]  [<f892ef99>] ext3_file_write+0x19/0x83 [ext3]
    Oct 15 12:22:39 sp5059b kernel: [2257339.255575]  [<c017460e>] do_sync_write+0xbf/0x100
    Oct 15 12:22:39 sp5059b kernel: [2257339.255581]  [<c0131a98>] autoremove_wake_function+0x0/0x2d
    Oct 15 12:22:39 sp5059b kernel: [2257339.255585]  [<c0132368>] set_process_cpu_timer+0x27/0xae
    Oct 15 12:22:39 sp5059b kernel: [2257339.255590]  [<c0125be2>] do_setitimer+0x2aa/0x31a
    Oct 15 12:22:39 sp5059b kernel: [2257339.255594]  [<c017e0a0>] fasync_helper+0x3c/0xb7
    Oct 15 12:22:39 sp5059b kernel: [2257339.255597]  [<c01bafe5>] security_file_permission+0xc/0xd
    Oct 15 12:22:39 sp5059b kernel: [2257339.255601]  [<c017454f>] do_sync_write+0x0/0x100
    Oct 15 12:22:39 sp5059b kernel: [2257339.255605]  [<c0174d80>] vfs_write+0x83/0x120
    Oct 15 12:22:39 sp5059b kernel: [2257339.255608]  [<c0175352>] sys_write+0x3c/0x63
    Oct 15 12:22:39 sp5059b kernel: [2257339.255612]  [<c01038d2>] syscall_call+0x7/0xb
    Oct 15 12:22:39 sp5059b kernel: [2257339.255618]  =======================

And some more errors:

    Oct 15 12:22:39 sp5059b kernel: [2257339.255618]  =======================
    Oct 15 13:52:55 sp5059b kernel: [2262820.703285] INFO: task kswapd0:173 blocked for more than 120 seconds.
    Oct 15 13:52:55 sp5059b kernel: [2262820.703310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Oct 15 13:52:55 sp5059b kernel: [2262820.703342] kswapd0       D 9527f7ea     0   173      2
    Oct 15 13:52:55 sp5059b kernel: [2262820.703346]        f747c4e0 00000046 c036f160 9527f7ea 0007f529 f747c66c c1fdbfc0 00000001 
    Oct 15 13:52:55 sp5059b kernel: [2262820.703352]        00000000 00000001 0051b47e 00000000 00000000 00000000 00000246 c0131ba7 
    Oct 15 13:52:55 sp5059b kernel: [2262820.703357]        f77dc400 f7731dc4 f77dc450 00396fe5 f8899e28 f77dc414 00000000 f747c4e0 
    Oct 15 13:52:55 sp5059b kernel: [2262820.703362] Call Trace:
    Oct 15 13:52:55 sp5059b kernel: [2262820.703374]  [<c0131ba7>] prepare_to_wait+0x12/0x4d
    Oct 15 13:52:55 sp5059b kernel: [2262820.703383]  [<f8899e28>] log_wait_commit+0x8b/0xd1 [jbd]
    Oct 15 13:52:55 sp5059b kernel: [2262820.703393]  [<c0131a98>] autoremove_wake_function+0x0/0x2d
    Oct 15 13:52:55 sp5059b kernel: [2262820.703398]  [<f88975dd>] journal_try_to_free_buffers+0x123/0x13b [jbd]
    Oct 15 13:52:55 sp5059b kernel: [2262820.703407]  [<f89320f7>] ext3_releasepage+0x0/0x57 [ext3]
    Oct 15 13:52:55 sp5059b kernel: [2262820.703420]  [<c0156409>] try_to_release_page+0x33/0x45
    Oct 15 13:52:55 sp5059b kernel: [2262820.703424]  [<c015ee80>] shrink_page_list+0x3c7/0x4a8
    Oct 15 13:52:55 sp5059b kernel: [2262820.703431]  [<c015e349>] isolate_lru_pages+0x44/0x17f
    Oct 15 13:52:55 sp5059b kernel: [2262820.703436]  [<c015e349>] isolate_lru_pages+0x44/0x17f
    Oct 15 13:52:55 sp5059b kernel: [2262820.703441]  [<c015f04f>] shrink_inactive_list+0xee/0x2fd
    Oct 15 13:52:55 sp5059b kernel: [2262820.703450]  [<c015f30e>] shrink_zone+0xb0/0xcd
    Oct 15 13:52:55 sp5059b kernel: [2262820.703454]  [<c015fa64>] kswapd+0x27b/0x3ed
    Oct 15 13:52:55 sp5059b kernel: [2262820.703459]  [<c015e484>] isolate_pages_global+0x0/0x42
    Oct 15 13:52:55 sp5059b kernel: [2262820.703463]  [<c0131a98>] autoremove_wake_function+0x0/0x2d
    Oct 15 13:52:55 sp5059b kernel: [2262820.703468]  [<c015f7e9>] kswapd+0x0/0x3ed
    Oct 15 13:52:55 sp5059b kernel: [2262820.703471]  [<c01319d7>] kthread+0x38/0x5d
    Oct 15 13:52:55 sp5059b kernel: [2262820.703474]  [<c013199f>] kthread+0x0/0x5d
    Oct 15 13:52:55 sp5059b kernel: [2262820.703477]  [<c01044f7>] kernel_thread_helper+0x7/0x10
    Oct 15 13:52:55 sp5059b kernel: [2262820.703482]  =======================
    Oct 17 16:39:21 sp5059b kernel: [2448749.835890] INFO: task exim4:15225 blocked for more than 120 seconds.
    Oct 17 16:39:21 sp5059b kernel: [2448749.835915] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Oct 17 16:39:21 sp5059b kernel: [2448749.835947] exim4         D 0f35bdb1     0 15225  15223
    Oct 17 16:39:21 sp5059b kernel: [2448749.835951]        c24032c0 00000082 c1fdbfc0 0f35bdb1 00089b8e c240344c c1fdbfc0 00000001 
    Oct 17 16:39:21 sp5059b kernel: [2448749.835956]        00000000 00000001 000013fa 00000000 00000000 00000000 00000246 c0131ba7 
    Oct 17 16:39:21 sp5059b kernel: [2448749.835962]        f77dc400 e6185f20 f77dc450 0039e41c f8899e28 f77dc414 00000000 c24032c0 
    Oct 17 16:39:21 sp5059b kernel: [2448749.835967] Call Trace:
    Oct 17 16:39:21 sp5059b kernel: [2448749.835978]  [<c0131ba7>] prepare_to_wait+0x12/0x4d
    Oct 17 16:39:21 sp5059b kernel: [2448749.835986]  [<f8899e28>] log_wait_commit+0x8b/0xd1 [jbd]
    Oct 17 16:39:21 sp5059b kernel: [2448749.835996]  [<c0131a98>] autoremove_wake_function+0x0/0x2d
    Oct 17 16:39:21 sp5059b kernel: [2448749.836001]  [<f8896451>] journal_stop+0x12f/0x151 [jbd]
    Oct 17 16:39:21 sp5059b kernel: [2448749.836009]  [<f892f0b7>] ext3_sync_file+0x57/0x9c [ext3]
    Oct 17 16:39:21 sp5059b kernel: [2448749.836022]  [<c015744c>] filemap_fdatawrite+0x12/0x16
    Oct 17 16:39:21 sp5059b kernel: [2448749.836027]  [<c018f5b5>] do_fsync+0x41/0x83
    Oct 17 16:39:21 sp5059b kernel: [2448749.836031]  [<c018f614>] __do_fsync+0x1d/0x2b
    Oct 17 16:39:21 sp5059b kernel: [2448749.836034]  [<c01038d2>] syscall_call+0x7/0xb
    Oct 17 16:39:21 sp5059b kernel: [2448749.836041]  =======================
Scott Pack
  • 14,907
  • 10
  • 53
  • 83
user568829
  • 231
  • 1
  • 3
  • 8

4 Answers4

3

It sounds to me like you may have had an enormous number of files in the directory. I've seen very similar symptoms to what you describe with very large directories.

There may have been some sort of issue (possibly a configuration issue with logrotate) where you ended up with millions of files. Once you get into the millions, most filesystems start having issues. Just doing an ls can take an enormous amount of time, and may even die if you wait long enough. Commands like rm will similarly have issues due to the sheer number of files. Similarly, things like logrotate will appear to hang because doing simple operations on the directory will take a really long time to complete.

One way to verify this is to do an ls -l on the directory above (/var/log/apache2/ in this case), and view the size of the directory entry (not the contents). For example (from a server I have, where there are a few directories with many files):

$ ls -lh
rwxrwxr-x 2 ironport ironport  11M Apr 21 19:37 oma01syslog01/
drwxrwxr-x 2 ironport ironport  12M Apr 21 19:37 oma01syslog02/
drwxrwxr-x 2 ironport ironport 5.8M Mar 17 12:30 sat01syslog01/
drwxrwxr-x 2 ironport ironport 4.0K Jun 29  2011 swn01syslog01/

$ for DIR in * ; do echo -n "$DIR:  " ; find $DIR -type f -print | wc -l ; done
oma01syslog01:  204332
oma01syslog02:  195500
sat01syslog01:  70960
swn01syslog01:  0

Notice the size of the directory entry as compared to the number of files in it.

Your best bet is to try to view the files with something like find. It doesn't try to pull the whole list before displaying them, and can allow you to query and operate on a directory without the issues you see in ls and rm. Some useful examples are below.

Number of files:

find ./directory/name/here -type f -print | wc -l

View a sample of files in the directory:

find ./directory/name/here -type f -print | head -20 | xargs ls -lh

Remove all files matching a pattern from the directory:

find ./directory/name/here -type f -iname '*.log' -print | xargs rm -rf

Note: I just noticed your comment, where you showed an ls from one directory up:

drwxr-xr-x 2 root root 126406656 2011-10-17 18:51 vhosts

This confirms my suspicions. 126406656 bytes is 120MB. That's a pretty huge directory entry (from my example above, I had an 11MB directory entry with 200k files), and suggests that you had millions of files in that directory.

Christopher Cashell
  • 9,128
  • 2
  • 32
  • 44
  • 2
    If you use `ls -f` it should report the directory entries as they are found, without trying to sort. Used for precisely this kind of situation. – mpez0 Apr 22 '14 at 00:35
  • @mpez0 - Good point. I usually jump to `find` for the flexibility and filtering features that I typically want for handling large directories, but `ls -f` handles simple needs quite well, too. – Christopher Cashell Apr 23 '14 at 17:16
0

This should be on Server Fault, but:

the read bit on your directory seems off: use something like chmod a+rdirectoryname to correct it.

This is most likely to happen when the directory is created; does the log rotation script do that? It's wholly expected that multiple concurrent instances of this script will cause problems, but I'm surprised that it would create permission issues on the directory. In any case, you need a faster way to rotate, and possibly a mechanism to keep the script from starting when it's already running.

reinierpost
  • 412
  • 3
  • 9
  • hmm, that didnt seem to help. When I 'ls' in /var/log/apache2 I get tihs for the vhosts directory: drwxr-xr-x 2 root root 126406656 2011-10-17 18:51 vhosts. The log rotation script does not create the file /var/log/apache2/vhosts - that is a static directory. –  Oct 17 '11 at 08:22
  • OK ... you can check the `/proc` filesystem or use the `lsof` command to see which processes have which files open. I'd also check the logs in `/var/log` to see if your disk is dying. I'm really just guessing here. – reinierpost Oct 17 '11 at 08:25
  • Your `/var/log` is on a local disk, right? Mounting it from another machine might lead to problems like this. – reinierpost Oct 17 '11 at 08:26
  • yes - its on the local machine. –  Oct 17 '11 at 08:35
  • Filesystem Size Used Avail Use% Mounted on /dev/hda3 33G 26G 6.2G 81% / tmpfs 998M 0 998M 0% /lib/init/rw udev 10M 576K 9.5M 6% /dev tmpfs 998M 0 998M 0% /dev/shm /dev/hda1 99M 27M 67M 29% /boot –  Oct 17 '11 at 08:35
  • reinierpost thanks, I have posted the error I am seeing alot of /var/log/messages (in the original question) –  Oct 17 '11 at 08:40
  • Locking-related, is my next guess. Who is doing the locking? – reinierpost Oct 18 '11 at 08:34
0

Hung tasks are a very severe error. It's so severe there's a proc file called /proc/sys/kernel/hung_task_panic to automatically panic/reboot. It means a process has hung in the kernel for a very long time, this shouldn't happen under normal usage.

  • Check your memory usage/swap
  • Check your filesystem/disk
cdleonard
  • 123
  • 3
  • Memory Usage seems normal. It is a remote dedicated server so I have not been able to run fsck or anything similar. I am just backing up the entire system before I proceed with trying to fix the problem. I am trying to determine if it is a Kernel bug, a corrupt file system or a bad hard drive... Or something else. – user568829 Oct 17 '11 at 13:57
0

The problem seems to have resolved itself??!?

I noticed the load on the server had gone down from its peaking average of about 70 back to hovering around 1-3 (where it normally sits).

I ran: screen rm -rf vhosts

Came back about half an hour later and tried to resume screen and nothing. Yay! The directory had been deleted.

Really none the wiser as to what happened / is happening, but at least part of the problem is gone.

No more strange kernel errors either.

user568829
  • 231
  • 1
  • 3
  • 8
  • Perhaps you can monitor for two log rotation instances or prevent that from happening. – reinierpost Oct 18 '11 at 08:35
  • I think it was the strange directory that was causing problems with logrotate. I think logrotate was trying to process the data in the /var/log/apache2/vhosts directory and just getting stuck for more than 24 hours and then another logrotate would start, compounding the problem. I did notice when the server was 'sick' yesterday my backup process (which lists the files to screen as it backs up) was very very slow. – user568829 Oct 18 '11 at 11:33