I'm having a weird issue. A few times a week, a website I'm running on Amazon Lightsail comes to a crashing halt and becomes IO bound (I can't usually use ssh when this happens) and the server runs out of memory. It's a lightly trafficked site with less than 1000 pages views a day and it's behind a varnish server. I can't run the top
command to see the issue because the server becomes IO bound. I think it's thrashing due to low memory. The server has 2GB of RAM which I think should be plenty. The site is not under heavy load when this happens. It just occurs kind of out of the blue. I was able to catch it today and run top
and see the CPU load average is in the 30s and 40s. These episodes can happen for about 30 min. to an hour and then just go away as quickly as they came.
I think the issue with a large mysql query but I'm not sure. Here is what top says whenever the binary is run:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1946 mysql 20 0 1502.5m 615.4m 27.5m S 0.3 30.8 0:35.33 mysqld.bin
The 1502.5 MB VIRT figure and 30.8% memory usage seems awfully high to me.
I had an episode today and looked in the syslog and found this:
3748 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901476] .php-fpm.bin invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
3749 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901481] .php-fpm.bin cpuset=/ mems_allowed=0
3750 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901486] CPU: 0 PID: 13816 Comm: .php-fpm.bin Not tainted 4.4.0-1114-aws #127-Ubuntu
3751 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901487] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
3752 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901489] 0000000000000286 b444799d441c3af0 ffff88003ec079e0 ffffffff8140accb
3753 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901491] ffff88003ec07b98 ffff880021027000 ffff88003ec07a50 ffffffff81216c47
3754 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901493] 0000000000000000 ffff88007a349f00 ffff88007b59e200 ffff88003ec07a38
3755 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901496] Call Trace:
3756 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901502] [<ffffffff8140accb>] dump_stack+0x6d/0x92
3757 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901507] [<ffffffff81216c47>] dump_header+0x5a/0x1c3
3758 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901512] [<ffffffff813a25f1>] ? apparmor_capable+0x131/0x1b0
3759 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901517] [<ffffffff8119b35b>] oom_kill_process+0x20b/0x3d0
3760 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901519] [<ffffffff8119b768>] out_of_memory+0x1f8/0x470
3761 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901523] [<ffffffff811a17d3>] __alloc_pages_slowpath.constprop.89+0x943/0xaf0
3762 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901526] [<ffffffff811a1c1f>] __alloc_pages_nodemask+0x29f/0x2b0
3763 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901529] [<ffffffff811ed00c>] alloc_pages_current+0x8c/0x110
3764 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901531] [<ffffffff8119748b>] __page_cache_alloc+0xab/0xc0
3765 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901534] [<ffffffff81199dc0>] filemap_fault+0x160/0x440
3766 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901536] [<ffffffff812b07b6>] ext4_filemap_fault+0x36/0x50
3767 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901539] [<ffffffff811c7537>] __do_fault+0x77/0x110
3768 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901542] [<ffffffff811cb4d9>] handle_mm_fault+0x1259/0x1b80
3769 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901546] [<ffffffff8106de84>] __do_page_fault+0x1a4/0x410
3770 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901548] [<ffffffff8106e112>] do_page_fault+0x22/0x30
3771 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901552] [<ffffffff81849b38>] page_fault+0x28/0x30
3772 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901568] Mem-Info:
3773 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] active_anon:463535 inactive_anon:6034 isolated_anon:0
3774 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] active_file:247 inactive_file:446 isolated_file:0
3775 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] unevictable:913 dirty:0 writeback:0 unstable:0
3776 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] slab_reclaimable:5581 slab_unreclaimable:7888
3777 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] mapped:22790 shmem:30428 pagetables:3630 bounce:0
3778 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] free:13737 free_pcp:59 free_cma:0
3779 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901576] Node 0 DMA free:8212kB min:352kB low:440kB high:528kB active_anon:6976kB inactive_anon:48kB active_file:16kB inactive_file:128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:372kB shmem:512kB slab_reclaimable:32kB slab_unreclaimable:148kB kernel_stack:0kB pagetables:176kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:876 all_unreclaimable? yes
3780 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901582] lowmem_reserve[]: 0 1967 1967 1967 1967
3781 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901584] Node 0 DMA32 free:46736kB min:44700kB low:55872kB high:67048kB active_anon:1847164kB inactive_anon:24088kB active_file:972kB inactive_file:1656kB unevictable:3652kB isolated(anon):0kB isolated(file):0kB present:2080768kB managed:2030696kB mlocked:3652kB dirty:0kB writeback:0kB mapped:90788kB shmem:121200kB slab_reclaimable:22292kB slab_unreclaimable:31404kB kernel_stack:5696kB pagetables:143 44kB unstable:0kB bounce:0kB free_pcp:236kB local_pcp:236kB free_cma:0kB writeback_tmp:0kB pages_scanned:20120 all_unreclaimable? yes
3782 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901589] lowmem_reserve[]: 0 0 0 0 0
3783 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901592] Node 0 DMA: 3*4kB (UM) 3*8kB (UM) 1*16kB (M) 5*32kB (UME) 3*64kB (UME) 3*128kB (UME) 3*256kB (UME) 3*512kB (UME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 8212kB
3784 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901601] Node 0 DMA32: 366*4kB (UMEH) 809*8kB (UMEH) 789*16kB (UEH) 322*32kB (UMEH) 142*64kB (UMEH) 43*128kB (UMEH) 5*256kB (ME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46736kB
3785 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901610] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
3786 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901611] 31732 total pagecache pages
3787 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901612] 0 pages in swap cache
3788 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901614] Swap cache stats: add 0, delete 0, find 0/0
3789 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901614] Free swap = 0kB
3790 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901615] Total swap = 0kB
3791 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901616] 524189 pages RAM
3792 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901617] 0 pages HighMem/MovableOnly
3793 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901618] 12540 pages reserved
3794 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901619] 0 pages cma reserved
3795 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901620] 0 pages hwpoisoned
3796 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901621] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
3797 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901625] [ 428] 0 428 23692 154 17 3 0 0 lvmetad
3798 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901628] [ 458] 0 458 10902 860 22 3 0 -1000 systemd-udevd
3799 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901630] [ 469] 0 469 3050 1149 13 3 0 0 haveged
3800 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901632] [ 472] 0 472 9561 857 22 3 0 0 systemd-journal
3801 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901634] [ 689] 100 689 25080 104 18 3 0 0 systemd-timesyn
3802 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901636] [ 1043] 0 1043 4030 471 13 3 0 0 dhclient
3803 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901638] [ 1242] 0 1242 1098 310 8 3 0 0 acpid
3804 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901640] [ 1244] 0 1244 100456 3563 41 4 0 0 amazon-ssm-agen
3805 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901642] [ 1250] 0 1250 159000 581 32 3 0 0 lxcfs
3806 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901644] [ 1261] 0 1261 68965 201 37 3 0 0 accounts-daemon
3807 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901647] [ 1265] 107 1265 10721 131 26 3 0 -900 dbus-daemon
3808 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901649] [ 1273] 0 1273 1304 29 8 3 0 0 iscsid
3809 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901651] [ 1274] 0 1274 1429 877 9 3 0 -17 iscsid
3810 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901653] [ 1317] 104 1317 64097 439 28 3 0 0 rsyslogd
3811 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901655] [ 1320] 0 1320 6510 370 18 3 0 0 atd
3812 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901656] [ 1324] 0 1324 7251 326 19 3 0 0 cron
3813 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901658] [ 1326] 0 1326 7153 187 18 3 0 0 systemd-logind
3814 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901660] [ 1349] 0 1349 43655 2454 52 3 0 0 unattended-upgr
3815 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901662] [ 1356] 0 1356 3342 227 11 3 0 0 mdadm
3816 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901664] [ 1405] 0 1405 69271 205 39 3 0 0 polkitd
3817 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901666] [ 1522] 0 1522 26724 526 53 3 0 0 sendmail-mta
3818 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901668] [ 2479] 0 2479 3596 561 12 5 0 0 gonit
3819 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901670] [ 2519] 0 2519 16377 379 35 3 0 -1000 sshd
3820 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901672] [11494] 1000 11494 11296 465 27 3 0 0 systemd
3821 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901674] [11499] 1000 11499 15834 471 34 3 0 0 (sd-pam)
3822 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901676] [ 537] 0 537 75724 1259 62 3 0 0 .php-fpm.bin
3823 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901678] [ 5327] 0 5327 51057 8905 101 3 0 0 httpd.bin
3824 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901680] [17152] 1 17152 334748 40721 260 4 0 0 httpd.bin
3825 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901682] [ 990] 1 990 105420 25430 120 3 0 0 .php-fpm.bin
3826 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901684] [ 991] 1 991 158017 16922 227 3 0 0 .php-fpm.bin
3827 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901686] [ 3368] 1 3368 103879 24095 118 3 0 0 .php-fpm.bin
3828 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901687] [ 8528] 1 8528 104911 25139 119 3 0 0 .php-fpm.bin
3829 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901689] [13391] 1 13391 334623 43165 266 4 0 0 httpd.bin
3830 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901691] [13707] 0 13707 26724 598 52 3 0 0 sendmail-mta
3831 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901693] [13815] 1 13815 79148 14627 91 3 0 0 .php-fpm.bin
3832 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901695] [13816] 1 13816 79423 14907 91 3 0 0 .php-fpm.bin
3833 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901697] [13817] 1 13817 79423 14586 91 3 0 0 .php-fpm.bin
3834 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901699] [13818] 1 13818 79147 14479 91 3 0 0 .php-fpm.bin
3835 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901701] [13819] 1 13819 79423 14589 91 3 0 0 .php-fpm.bin
3836 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901703] [13823] 1 13823 79407 14762 91 3 0 0 .php-fpm.bin
3837 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901705] [13989] 0 13989 1125 230 8 3 0 0 ctl.sh
3838 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901707] [13992] 0 13992 1125 210 8 3 0 0 ctl.sh
3839 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901709] [13993] 0 13993 1125 237 8 3 0 0 ctl.sh
3840 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901711] [13994] 0 13994 1125 224 8 3 0 0 ctl.sh
3841 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901713] [13995] 0 13995 1125 254 7 3 0 0 ctl.sh
3842 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901714] [13996] 0 13996 1125 261 8 3 0 0 ctl.sh
3843 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901716] [14006] 0 14006 1125 396 8 3 0 0 mysqld_safe
3844 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901718] [14010] 0 14010 1125 387 7 3 0 0 mysqld_safe
3845 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901720] [14012] 0 14012 1125 384 8 3 0 0 mysqld_safe
3846 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901722] [14014] 0 14014 1125 388 8 3 0 0 mysqld_safe
3847 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901724] [14552] 0 14552 75229 3448 37 5 0 -900 snapd
3848 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901726] [16057] 0 16057 92857 70693 180 3 0 0 mysqld.bin
3849 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901728] [16066] 0 16066 92857 70677 179 3 0 0 mysqld.bin
3850 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901730] [16067] 0 16067 92857 70718 179 3 0 0 mysqld.bin
3851 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901732] [16068] 0 16068 107007 70665 180 3 0 0 mysqld.bin
3852 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901734] [16097] 1 16097 51090 1094 73 3 0 0 httpd.bin
3853 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901736] [16110] 1 16110 51057 1191 73 3 0 0 httpd.bin
3854 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901738] [16111] 1 16111 51057 1061 73 3 0 0 httpd.bin
3855 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901740] [16136] 0 16136 9756 46 20 3 0 0 sshd
3856 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901741] [16155] 0 16155 1821 83 8 3 0 0 sleep
3857 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901743] [16156] 0 16156 1821 35 10 3 0 0 sleep
3858 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901745] [16157] 0 16157 1821 37 9 3 0 0 sleep
3859 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901747] [16158] 0 16158 1821 52 8 3 0 0 sleep
3860 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901749] [16159] 0 16159 1821 41 9 3 0 0 sleep
3861 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901751] [16160] 0 16160 1821 36 9 3 0 0 sleep
3862 Oct 7 17:41:54 ip-172-26-1-34 kernel: [1128654.901753] Out of memory: Kill process 16067 (mysqld.bin) score 134 or sacrifice child
I'm not sure how to interpret this. Can someone help steer me in the right direction as far as how to track down what is bringing this machine to its knees?