0

I'm having a weird issue. A few times a week, a website I'm running on Amazon Lightsail comes to a crashing halt and becomes IO bound (I can't usually use ssh when this happens) and the server runs out of memory. It's a lightly trafficked site with less than 1000 pages views a day and it's behind a varnish server. I can't run the top command to see the issue because the server becomes IO bound. I think it's thrashing due to low memory. The server has 2GB of RAM which I think should be plenty. The site is not under heavy load when this happens. It just occurs kind of out of the blue. I was able to catch it today and run top and see the CPU load average is in the 30s and 40s. These episodes can happen for about 30 min. to an hour and then just go away as quickly as they came.

I think the issue with a large mysql query but I'm not sure. Here is what top says whenever the binary is run:

PID  USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+    COMMAND
1946 mysql     20   0 1502.5m 615.4m  27.5m S  0.3 30.8   0:35.33 mysqld.bin

The 1502.5 MB VIRT figure and 30.8% memory usage seems awfully high to me.

I had an episode today and looked in the syslog and found this:

3748 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901476] .php-fpm.bin invoked oom-killer: gfp_mask=0x24201ca, order=0, oom_score_adj=0
3749 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901481] .php-fpm.bin cpuset=/ mems_allowed=0
3750 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901486] CPU: 0 PID: 13816 Comm: .php-fpm.bin Not tainted 4.4.0-1114-aws #127-Ubuntu
3751 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901487] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
3752 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901489]  0000000000000286 b444799d441c3af0 ffff88003ec079e0 ffffffff8140accb
3753 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901491]  ffff88003ec07b98 ffff880021027000 ffff88003ec07a50 ffffffff81216c47
3754 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901493]  0000000000000000 ffff88007a349f00 ffff88007b59e200 ffff88003ec07a38
3755 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901496] Call Trace:
3756 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901502]  [<ffffffff8140accb>] dump_stack+0x6d/0x92
3757 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901507]  [<ffffffff81216c47>] dump_header+0x5a/0x1c3
3758 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901512]  [<ffffffff813a25f1>] ? apparmor_capable+0x131/0x1b0
3759 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901517]  [<ffffffff8119b35b>] oom_kill_process+0x20b/0x3d0
3760 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901519]  [<ffffffff8119b768>] out_of_memory+0x1f8/0x470
3761 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901523]  [<ffffffff811a17d3>] __alloc_pages_slowpath.constprop.89+0x943/0xaf0
3762 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901526]  [<ffffffff811a1c1f>] __alloc_pages_nodemask+0x29f/0x2b0
3763 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901529]  [<ffffffff811ed00c>] alloc_pages_current+0x8c/0x110
3764 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901531]  [<ffffffff8119748b>] __page_cache_alloc+0xab/0xc0
3765 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901534]  [<ffffffff81199dc0>] filemap_fault+0x160/0x440
3766 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901536]  [<ffffffff812b07b6>] ext4_filemap_fault+0x36/0x50
3767 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901539]  [<ffffffff811c7537>] __do_fault+0x77/0x110
3768 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901542]  [<ffffffff811cb4d9>] handle_mm_fault+0x1259/0x1b80
3769 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901546]  [<ffffffff8106de84>] __do_page_fault+0x1a4/0x410
3770 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901548]  [<ffffffff8106e112>] do_page_fault+0x22/0x30
3771 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901552]  [<ffffffff81849b38>] page_fault+0x28/0x30
3772 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901568] Mem-Info:
3773 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572] active_anon:463535 inactive_anon:6034 isolated_anon:0
3774 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572]  active_file:247 inactive_file:446 isolated_file:0
3775 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572]  unevictable:913 dirty:0 writeback:0 unstable:0
3776 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572]  slab_reclaimable:5581 slab_unreclaimable:7888
3777 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572]  mapped:22790 shmem:30428 pagetables:3630 bounce:0
3778 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901572]  free:13737 free_pcp:59 free_cma:0
3779 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901576] Node 0 DMA free:8212kB min:352kB low:440kB high:528kB active_anon:6976kB inactive_anon:48kB active_file:16kB inactive_file:128kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:372kB shmem:512kB slab_reclaimable:32kB slab_unreclaimable:148kB kernel_stack:0kB pagetables:176kB unstable:0kB bounce:0kB free_pcp:0kB      local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:876 all_unreclaimable? yes
3780 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901582] lowmem_reserve[]: 0 1967 1967 1967 1967
3781 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901584] Node 0 DMA32 free:46736kB min:44700kB low:55872kB high:67048kB active_anon:1847164kB inactive_anon:24088kB active_file:972kB inactive_file:1656kB unevictable:3652kB isolated(anon):0kB isolated(file):0kB present:2080768kB managed:2030696kB mlocked:3652kB dirty:0kB writeback:0kB mapped:90788kB shmem:121200kB slab_reclaimable:22292kB slab_unreclaimable:31404kB kernel_stack:5696kB pagetables:143     44kB unstable:0kB bounce:0kB free_pcp:236kB local_pcp:236kB free_cma:0kB writeback_tmp:0kB pages_scanned:20120 all_unreclaimable? yes
3782 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901589] lowmem_reserve[]: 0 0 0 0 0
3783 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901592] Node 0 DMA: 3*4kB (UM) 3*8kB (UM) 1*16kB (M) 5*32kB (UME) 3*64kB (UME) 3*128kB (UME) 3*256kB (UME) 3*512kB (UME) 1*1024kB (E) 2*2048kB (ME) 0*4096kB = 8212kB
3784 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901601] Node 0 DMA32: 366*4kB (UMEH) 809*8kB (UMEH) 789*16kB (UEH) 322*32kB (UMEH) 142*64kB (UMEH) 43*128kB (UMEH) 5*256kB (ME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46736kB
3785 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901610] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
3786 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901611] 31732 total pagecache pages
3787 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901612] 0 pages in swap cache
3788 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901614] Swap cache stats: add 0, delete 0, find 0/0
3789 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901614] Free swap  = 0kB
3790 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901615] Total swap = 0kB
3791 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901616] 524189 pages RAM
3792 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901617] 0 pages HighMem/MovableOnly
3793 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901618] 12540 pages reserved
3794 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901619] 0 pages cma reserved
3795 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901620] 0 pages hwpoisoned
3796 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901621] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
3797 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901625] [  428]     0   428    23692      154      17       3        0             0 lvmetad
3798 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901628] [  458]     0   458    10902      860      22       3        0         -1000 systemd-udevd
3799 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901630] [  469]     0   469     3050     1149      13       3        0             0 haveged
3800 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901632] [  472]     0   472     9561      857      22       3        0             0 systemd-journal
3801 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901634] [  689]   100   689    25080      104      18       3        0             0 systemd-timesyn
3802 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901636] [ 1043]     0  1043     4030      471      13       3        0             0 dhclient
3803 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901638] [ 1242]     0  1242     1098      310       8       3        0             0 acpid
3804 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901640] [ 1244]     0  1244   100456     3563      41       4        0             0 amazon-ssm-agen
3805 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901642] [ 1250]     0  1250   159000      581      32       3        0             0 lxcfs
3806 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901644] [ 1261]     0  1261    68965      201      37       3        0             0 accounts-daemon
3807 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901647] [ 1265]   107  1265    10721      131      26       3        0          -900 dbus-daemon
3808 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901649] [ 1273]     0  1273     1304       29       8       3        0             0 iscsid
3809 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901651] [ 1274]     0  1274     1429      877       9       3        0           -17 iscsid
3810 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901653] [ 1317]   104  1317    64097      439      28       3        0             0 rsyslogd
3811 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901655] [ 1320]     0  1320     6510      370      18       3        0             0 atd
3812 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901656] [ 1324]     0  1324     7251      326      19       3        0             0 cron
3813 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901658] [ 1326]     0  1326     7153      187      18       3        0             0 systemd-logind
3814 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901660] [ 1349]     0  1349    43655     2454      52       3        0             0 unattended-upgr
3815 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901662] [ 1356]     0  1356     3342      227      11       3        0             0 mdadm
3816 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901664] [ 1405]     0  1405    69271      205      39       3        0             0 polkitd
3817 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901666] [ 1522]     0  1522    26724      526      53       3        0             0 sendmail-mta
3818 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901668] [ 2479]     0  2479     3596      561      12       5        0             0 gonit
3819 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901670] [ 2519]     0  2519    16377      379      35       3        0         -1000 sshd
3820 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901672] [11494]  1000 11494    11296      465      27       3        0             0 systemd
3821 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901674] [11499]  1000 11499    15834      471      34       3        0             0 (sd-pam)
3822 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901676] [  537]     0   537    75724     1259      62       3        0             0 .php-fpm.bin
3823 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901678] [ 5327]     0  5327    51057     8905     101       3        0             0 httpd.bin
3824 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901680] [17152]     1 17152   334748    40721     260       4        0             0 httpd.bin
3825 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901682] [  990]     1   990   105420    25430     120       3        0             0 .php-fpm.bin
3826 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901684] [  991]     1   991   158017    16922     227       3        0             0 .php-fpm.bin
3827 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901686] [ 3368]     1  3368   103879    24095     118       3        0             0 .php-fpm.bin
3828 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901687] [ 8528]     1  8528   104911    25139     119       3        0             0 .php-fpm.bin
3829 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901689] [13391]     1 13391   334623    43165     266       4        0             0 httpd.bin
3830 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901691] [13707]     0 13707    26724      598      52       3        0             0 sendmail-mta
3831 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901693] [13815]     1 13815    79148    14627      91       3        0             0 .php-fpm.bin
3832 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901695] [13816]     1 13816    79423    14907      91       3        0             0 .php-fpm.bin
3833 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901697] [13817]     1 13817    79423    14586      91       3        0             0 .php-fpm.bin
3834 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901699] [13818]     1 13818    79147    14479      91       3        0             0 .php-fpm.bin
3835 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901701] [13819]     1 13819    79423    14589      91       3        0             0 .php-fpm.bin
3836 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901703] [13823]     1 13823    79407    14762      91       3        0             0 .php-fpm.bin
3837 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901705] [13989]     0 13989     1125      230       8       3        0             0 ctl.sh
3838 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901707] [13992]     0 13992     1125      210       8       3        0             0 ctl.sh
3839 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901709] [13993]     0 13993     1125      237       8       3        0             0 ctl.sh
3840 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901711] [13994]     0 13994     1125      224       8       3        0             0 ctl.sh
3841 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901713] [13995]     0 13995     1125      254       7       3        0             0 ctl.sh
3842 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901714] [13996]     0 13996     1125      261       8       3        0             0 ctl.sh
3843 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901716] [14006]     0 14006     1125      396       8       3        0             0 mysqld_safe
3844 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901718] [14010]     0 14010     1125      387       7       3        0             0 mysqld_safe
3845 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901720] [14012]     0 14012     1125      384       8       3        0             0 mysqld_safe
3846 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901722] [14014]     0 14014     1125      388       8       3        0             0 mysqld_safe
3847 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901724] [14552]     0 14552    75229     3448      37       5        0          -900 snapd
3848 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901726] [16057]     0 16057    92857    70693     180       3        0             0 mysqld.bin
3849 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901728] [16066]     0 16066    92857    70677     179       3        0             0 mysqld.bin
3850 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901730] [16067]     0 16067    92857    70718     179       3        0             0 mysqld.bin
3851 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901732] [16068]     0 16068   107007    70665     180       3        0             0 mysqld.bin
3852 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901734] [16097]     1 16097    51090     1094      73       3        0             0 httpd.bin
3853 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901736] [16110]     1 16110    51057     1191      73       3        0             0 httpd.bin
3854 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901738] [16111]     1 16111    51057     1061      73       3        0             0 httpd.bin
3855 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901740] [16136]     0 16136     9756       46      20       3        0             0 sshd
3856 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901741] [16155]     0 16155     1821       83       8       3        0             0 sleep
3857 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901743] [16156]     0 16156     1821       35      10       3        0             0 sleep
3858 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901745] [16157]     0 16157     1821       37       9       3        0             0 sleep
3859 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901747] [16158]     0 16158     1821       52       8       3        0             0 sleep
3860 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901749] [16159]     0 16159     1821       41       9       3        0             0 sleep
3861 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901751] [16160]     0 16160     1821       36       9       3        0             0 sleep
3862 Oct  7 17:41:54 ip-172-26-1-34 kernel: [1128654.901753] Out of memory: Kill process 16067 (mysqld.bin) score 134 or sacrifice child

I'm not sure how to interpret this. Can someone help steer me in the right direction as far as how to track down what is bringing this machine to its knees?

StevieD
  • 514
  • 8
  • 24
  • 1
    You ran out of memory and the kernel killed MySQL. I haven't used MySQL in a long time but I wonder if it's normal for it to have 4 mysqld running? Do you have some cron or other to start it? – Ginnungagap Oct 08 '20 at 06:08
  • This is a stack built by bitnami. So I'm not sure how they have it configured. I assumed those daemons only spawned as needed and then disappeared. Not sure. – StevieD Oct 08 '20 at 10:32
  • If your site has so little traffic, what is using all the memory in your instance? – Michael Hampton Oct 08 '20 at 15:27
  • It's definitely mysql. It's a fairly large database, about 3 GB, so perhaps that is the issue. I did some testing and when I set up the virtual server with its own database server, it performed well. So I am just going to go with a smaller front end and and a database server on the back end. – StevieD Oct 08 '20 at 20:45
  • Hi. I think I am seeing the same https://github.com/bitnami/vms/issues/927. Did you find anything else? thanks! – Mariano Martinez Peck May 21 '23 at 21:36

1 Answers1

0

Also, it appears that you don't have a swap file? So the first thing I would do is add a swapfile. (very temporary solution, that may buy you some time) The real question is when did this start happening and why? Use sar -f /var/log/sa/saXX

to look for past events and see if you can correlate the times. I believe you may be right, you may have a sqlquery running that is perhaps looking over the whole tablespace. Were any SQL queries introduced since this started happening?

Mark Scheck
  • 211
  • 1
  • 5
  • 1
    I forgot to mention this was a virtual machine built by Bitnami. I was asking in the bitnami forums the other day why the `free` command reported 0 bytes on swap on amazon lightsail. I didn't get a good answer. I'm deeply confused by this. But it's doubtful bitnami would overlook this so I assume there is a good reason. – StevieD Oct 08 '20 at 10:29
  • Ahhh a VM which I believe since the VM is a single file, a swap file really does nothing. Though I would have to research this – Mark Scheck Oct 08 '20 at 13:02