0

We have a CentOS5.8 server with 64Gb RAM running normally around 700Mb RAM usage. When I try to do a tar -czf into about a 22Gb archive, the pagecache climbs up to about 22Gb, then the kernel starts killing processes. Eventually, the server becomes unresponsive even when plugging a USB keyboard up.

Fri Aug  5 11:00:01 CDT 2016
 11:00:01 up 1 day,  2:06,  3 users,  load average: 1.31, 1.71, 1.77
             total       used       free     shared    buffers     cached
Mem:         64237      22874      41363          0          0      22407
-/+ buffers/cache:        466      63771
Swap:         5279          0       5279
*****************************************************
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                      534G  106G  401G  21% /
/dev/sda3              97M   13M   80M  14% /boot
tmpfs                  32G     0   32G   0% /dev/shm
*****************************************************
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 42355760   1020 22944876    0    0    33    19    7    6  0  0 100  0  0
 0  0      0 42355760   1020 22945204    0    0     0     0 1003 2933  0  0 100  0  0
 0  0      0 42355760   1020 22945220    0    0     0     0 1040 2963  0  0 100  0  0
 0  0      0 42355760   1020 22945220    0    0     0     0 1004 2843  0  0 100  0  0
 0  0      0 42355760   1020 22945220    0    0     0     0 1036 2915  0  0 100  0  0

and the processes

root       993  0.0  0.0 S<s  00:00:00 /sbin/udevd -d
root      3578  0.0  0.0 S<Lsl 00:00:00 iscsiuio
root      3582  0.0  0.0 Ss   00:00:00 iscsid
root      3583  0.0  0.0 S<Ls 00:00:00 iscsid
root      4377  0.0  0.0 S<sl 00:00:00 auditd
root      4379  0.0  0.0 S<sl 00:00:00  \_ /sbin/audispd
root      4401  0.0  0.0 Ss   00:00:00 syslogd -m 0
root      4404  0.0  0.0 Ss   00:00:00 klogd -x
root      4540  0.0  0.0 Ss   00:00:00 irqbalance
rpc       4574  0.0  0.0 Ss   00:00:00 portmap
rpcuser   4655  0.0  0.0 Ss   00:00:00 rpc.statd
root      4683  0.0  0.0 Ss   00:00:00 rpc.idmapd
dbus      4705  0.0  0.0 Ss   00:00:00 dbus-daemon --system
root      4714  0.0  0.0 Ss   00:00:00 /usr/sbin/hcid
root      4720  0.0  0.0 Ss   00:00:00 /usr/sbin/sdpd
root      4786  0.0  0.0 S<   00:00:00 [krfcommd]
root      4823  0.0  0.0 Ssl  00:00:00 pcscd
root      4833  0.0  0.0 Ss   00:00:00 /usr/sbin/acpid
68        4846  0.0  0.0 Ss   00:00:01 hald
root      4847  0.0  0.0 S    00:00:00  \_ hald-runner
68        4855  0.0  0.0 S    00:00:00      \_ hald-addon-acpi: listening on acpid socket /var/run/acpid.socket
68        4862  0.0  0.0 S    00:00:00      \_ hald-addon-keyboard: listening on /dev/input/event0
68        4865  0.0  0.0 S    00:00:00      \_ hald-addon-keyboard: listening on /dev/input/event3
68        4868  0.0  0.0 S    00:00:00      \_ hald-addon-keyboard: listening on /dev/input/event2
root      4876  0.0  0.0 S    00:00:13      \_ hald-addon-storage: polling /dev/sr0
root      4878  0.0  0.0 S    00:00:00      \_ hald-addon-storage: polling /dev/sdb
root      4912  0.0  0.0 Ss   00:00:00 /usr/bin/hidd --server
root      4956  0.0  0.0 Ssl  00:00:00 automount
root      4973  0.0  0.0 Ss   00:00:00 ./hpiod
root      4978  0.0  0.0 S    00:00:00 /usr/bin/python ./hpssd.py
root      4991  0.0  0.0 Ss   00:00:00 /usr/sbin/sshd
root     12541  0.0  0.0 Ss   00:00:01  \_ sshd: root@pts/1
root     12543  0.0  0.0 Ss+  00:00:00  |   \_ -bash
root     13224  0.0  0.0 Ss   00:00:00  \_ sshd: metabk1 [priv]
metabk1  13226  0.0  0.0 S    00:00:00      \_ sshd: metabk1@pts/2
metabk1  13227  0.0  0.0 Ss+  00:00:00          \_ -bash
root      5000  0.0  0.0 Ss   00:00:00 cupsd
root      5023  0.0  0.0 Ss   00:00:06 sendmail: accepting connections
smmsp     5031  0.0  0.0 Ss   00:00:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
root      5041  0.0  0.0 Ss   00:00:00 gpm -m /dev/input/mice -t exps2
root      5050  0.0  0.0 Ss   00:00:00 crond
root     13284  0.0  0.0 S    00:00:00  \_ crond
root     13285  0.0  0.0 Ss   00:00:00      \_ /bin/sh -c /usr/local/bin/memmon.sh >> /var/log/memmon.txt
root     13286  0.0  0.0 S    00:00:00          \_ /bin/bash /usr/local/bin/memmon.sh
root     13292  0.0  0.0 R    00:00:00              \_ ps axfww -o ruser,pid,%cpu,%mem,stat,time,cmd
xfs       5079  0.0  0.0 Ss   00:00:00 xfs -droppriv -daemon
root      5096  0.0  0.0 Ss   00:00:00 /usr/sbin/atd
avahi     5135  0.0  0.0 Ss   00:00:00 avahi-daemon: running [RDVR.local]
avahi     5136  0.0  0.0 Ss   00:00:00  \_ avahi-daemon: chroot helper
root      5197  0.0  0.0 S    00:00:00 /usr/sbin/smartd -q never
root      5202  0.0  0.0 Ss+  00:00:00 /sbin/mingetty tty1
root      5203  0.0  0.0 Ss   00:00:00 login -- root
root     13025  0.0  0.0 Ss   00:00:00  \_ -bash
root     13060  0.7  0.0 S+   00:00:14      \_ top -M
root      5204  0.0  0.0 Ss+  00:00:00 /sbin/mingetty tty3
root      5205  0.0  0.0 Ss+  00:00:00 /sbin/mingetty tty4
root      5206  0.0  0.0 Ss+  00:00:00 /sbin/mingetty tty5
root      5208  0.0  0.0 Ss+  00:00:00 /sbin/mingetty tty6
root      5209  0.0  0.0 Ss   00:00:00 /usr/sbin/gdm-binary -nodaemon
root     13202  0.0  0.0 S    00:00:00  \_ /usr/sbin/gdm-binary -nodaemon
root     13203  0.6  0.0 Ss+  00:00:02      \_ /usr/bin/Xorg :0 -br -audit 0 -auth /var/gdm/:0.Xauth -nolisten tcp vt7
gdm      13216  0.1  0.0 Ss   00:00:00      \_ /usr/libexec/gdmgreeter
root      5251  0.0  0.0 Sl   00:00:00 /usr/libexec/gdm-rh-security-token-helper
root      5257  0.0  0.0 SN   00:00:00 /usr/bin/python -tt /usr/sbin/yum-updatesd
root     13217  0.1  0.0 SN   00:00:00  \_ /usr/bin/python -tt /usr/libexec/yum-updatesd-helper --check --dbus
root      5259  0.0  0.0 SN   00:00:00 /usr/libexec/gam_server

And finally some logs

DMESG:+Free pages:    42316788kB (42309540kB HighMem)
DMESG:+Active:28543 inactive:5737934 dirty:167940 writeback:0 unstable:0 free:10579197 slab:92371 mapped-file:5498 mapped-anon:21203 pagetables:727
DMESG:+DMA free:3588kB min:68kB low:84kB high:100kB active:12kB inactive:0kB present:16384kB pages_scanned:56981 all_unreclaimable? yes
DMESG:+lowmem_reserve[]: 0 0 880 65520
DMESG:+DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no
DMESG:+lowmem_reserve[]: 0 0 880 65520
DMESG:+Normal free:3660kB min:3756kB low:4692kB high:5632kB active:424kB inactive:44kB present:901120kB pages_scanned:621593 all_unreclaimable? yes
DMESG:+lowmem_reserve[]: 0 0 0 517120
DMESG:+HighMem free:42309540kB min:512kB low:69552kB high:138592kB active:113840kB inactive:22951588kB present:66191360kB pages_scanned:0 all_unreclaimable? no
DMESG:+lowmem_reserve[]: 0 0 0 0
DMESG:+DMA: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 3588kB
DMESG:+DMA32: empty
DMESG:+Normal: 1*4kB 23*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 3660kB
DMESG:+HighMem: 1*4kB 0*8kB 0*16kB 1*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 0*2048kB 10329*4096kB = 42309540kB
DMESG:+5745312 pagecache pages
DMESG:+Swap cache: add 0, delete 0, find 0/0, race 0+0
DMESG:+Free swap  = 5406712kB
DMESG:+Total swap = 5406712kB
DMESG:+Free swap:       5406712kB
DMESG:+16777216 pages of RAM
DMESG:+16547840 pages of HIGHMEM
DMESG:+332415 reserved pages
DMESG:+5728571 pages shared
DMESG:+0 pages swap cached
DMESG:+167940 pages dirty
DMESG:+0 pages writeback
DMESG:+5498 pages mapped
DMESG:+92360 pages slab
DMESG:+727 pages pagetables
DMESG:+Out of memory: Killed process 12751, UID 503, (bash).
DMESG:+top invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
DMESG:+ [<c045b9c3>] out_of_memory+0x72/0x1a3
DMESG:+ [<c045cf46>] __alloc_pages+0x24e/0x2cf
DMESG:+ [<c04a8ef3>] proc_info_read+0x0/0x96
DMESG:+ [<c045cfee>] __get_free_pages+0x27/0x33
DMESG:+ [<c04a8f2b>] proc_info_read+0x38/0x96
DMESG:+ [<c04a8ef3>] proc_info_read+0x0/0x96
DMESG:+ [<c04770e5>] vfs_read+0x9f/0x141
DMESG:+ [<c047756b>] sys_read+0x3c/0x63
DMESG:+ [<c0404f4b>] syscall_call+0x7/0xb
DMESG:+ =======================
chicks
  • 3,793
  • 10
  • 27
  • 36
KennyD
  • 3
  • 1
  • What was UID 503 doing in bash at that time? You could turn on auditd and some custom rules to log activities. Sar data can also be telling sometimes. – Aaron Aug 05 '16 at 16:52
  • Please show the output of `uname -r`. – Michael Hampton Aug 05 '16 at 18:05
  • @MichaelHampton `# uname -r 2.6.18-308.el5PAE` @Aaron UID 503 is metabk1 which was doing nothing at the time but usually is the user running the tar -czf which causes the memory usage to jump. This is soon after the tar -czf has finished so the user is sitting doing nothing. I'm looking into your suggestions for auditd and sar. I haven't used them before. – KennyD Aug 05 '16 at 18:14
  • 1
    That explains things. I hadn't even looked at the kernel. This server is off the grid and rarely gets used until now. We are going to reinstall OS. Thanks for the input folks. – KennyD Aug 05 '16 at 19:36

1 Answers1

2

You're running a 32-bit kernel (and an out of date one at that). This means that any single process is limited to somewhat less than 4GB of address space. This system should have been initially installed with a 64-bit OS. (And while you're at it, consider a newer distribution. CentOS 5 will go out of support soon.)

Michael Hampton
  • 244,070
  • 43
  • 506
  • 972