We have a server runnnig Linux 5.15 and we've had multiple verified situations where a process is killed by OOM and that leaves the whole system inaccessible on the network, both for inbound and outbound traffic. This is a recent syslog trail for the event:
Mar 8 05:16:01 ip-10-110-10-133 kernel: [203986.004138] amazon-cloudwat invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004146] CPU: 3 PID: 1627 Comm: amazon-cloudwat Not tainted 5.15.0-1031-aws #35~20.04.1-Ubuntu
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004149] Hardware name: Amazon EC2 r6i.2xlarge/, BIOS 1.0 10/16/2017
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004150] Call Trace:
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004152] <TASK>
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004155] dump_stack_lvl+0x4a/0x63
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004162] dump_stack+0x10/0x16
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004164] dump_header+0x53/0x225
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004168] oom_kill_process.cold+0xb/0x10
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004171] out_of_memory+0x1dc/0x530
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004174] __alloc_pages_slowpath.constprop.0+0xd32/0xe30
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004179] ? __alloc_pages_slowpath.constprop.0+0xdb6/0xe30
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004182] __alloc_pages+0x2cc/0x310
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004185] alloc_pages+0x90/0x120
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004187] __page_cache_alloc+0x87/0xc0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004191] pagecache_get_page+0x150/0x530
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004193] ? page_cache_ra_unbounded+0x16a/0x220
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004196] filemap_fault+0x527/0xb60
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004198] ? filemap_map_pages+0x138/0x640
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004201] __do_fault+0x3d/0x120
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004205] do_fault+0x1f9/0x420
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004207] __handle_mm_fault+0x62c/0x840
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004210] handle_mm_fault+0xd8/0x2c0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004211] do_user_addr_fault+0x1c2/0x660
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004215] exc_page_fault+0x77/0x170
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004218] asm_exc_page_fault+0x27/0x30
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004221] RIP: 0033:0x44c1a0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004235] Code: Unable to access opcode bytes at RIP 0x44c176.
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004235] RSP: 002b:000000c001705de8 EFLAGS: 00010246
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004238] RAX: 000000c001705f78 RBX: 000000c001705e8c RCX: 0000000000000000
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004240] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000000
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004241] RBP: 000000c001705fb8 R08: 0000000000000001 R09: 000000c00063bb30
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004242] R10: 000000c001705f00 R11: 000000c000b715c0 R12: 000000c000d73080
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004243] R13: ffffffffffffffff R14: 000000c000bec820 R15: 0000000000000000
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004246] </TASK>
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004247] Mem-Info:
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] active_anon:202 inactive_anon:15952689 isolated_anon:0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] active_file:146 inactive_file:0 isolated_file:0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] unevictable:6279 dirty:3 writeback:0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] slab_reclaimable:9465 slab_unreclaimable:13692
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] mapped:68073 shmem:256 pagetables:33680 bounce:0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] kernel_misc_reclaimable:0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004248] free:91381 free_pcp:1688 free_cma:0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004253] Node 0 active_anon:808kB inactive_anon:63810756kB active_file:584kB inactive_file:0kB unevictable:25116kB isolated(anon):0kB isolated(file):0kB mapped:272292kB dirty:12kB writeback:0kB shmem:1024kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:5536kB pagetables:134720kB all_unreclaimable? no
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004257] Node 0 DMA free:11264kB min:16kB low:28kB high:40kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004262] lowmem_reserve[]: 0 2991 63273 63273 63273
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004266] Node 0 DMA32 free:244188kB min:3188kB low:6244kB high:9300kB reserved_highatomic:0KB active_anon:0kB inactive_anon:2804896kB active_file:0kB inactive_file:544kB unevictable:0kB writepending:0kB present:3129252kB managed:3063716kB mlocked:0kB bounce:0kB free_pcp:684kB local_pcp:112kB free_cma:0kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004270] lowmem_reserve[]: 0 0 60281 60281 60281
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004273] Node 0 Normal free:110072kB min:115576kB low:177292kB high:239008kB reserved_highatomic:2048KB active_anon:808kB inactive_anon:61005860kB active_file:1364kB inactive_file:388kB unevictable:25116kB writepending:12kB present:62898176kB managed:61728412kB mlocked:18340kB bounce:0kB free_pcp:6084kB local_pcp:928kB free_cma:0kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004278] lowmem_reserve[]: 0 0 0 0 0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004281] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11264kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004292] Node 0 DMA32: 188*4kB (UME) 127*8kB (UME) 135*16kB (UME) 69*32kB (UME) 33*64kB (UME) 12*128kB (UME) 8*256kB (UME) 2*512kB (UE) 2*1024kB (UM) 2*2048kB (ME) 55*4096kB (M) = 244280kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004306] Node 0 Normal: 9199*4kB (UME) 5601*8kB (UME) 1449*16kB (UMEH) 193*32kB (UMEH) 7*64kB (MH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 111412kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004318] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004320] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004321] 4359 total pagecache pages
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004322] 0 pages in swap cache
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004323] Swap cache stats: add 0, delete 0, find 0/0
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004324] Free swap = 0kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004325] Total swap = 0kB
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004326] 16510855 pages RAM
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004326] 0 pages HighMem/MovableOnly
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004327] 308983 pages reserved
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004328] 0 pages hwpoisoned
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004328] Tasks state (memory values in pages):
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004329] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004337] [ 211] 0 211 106563 1109 811008 0 -250 systemd-journal
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004340] [ 248] 0 248 2247 979 61440 0 -1000 systemd-udevd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004343] [ 340] 0 340 53652 4488 94208 0 -1000 multipathd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004346] [ 360] 0 360 701 29 45056 0 0 falcond
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004348] [ 361] 0 361 696911 147512 1540096 0 0 falcon-sensor-b
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004350] [ 370] 0 370 2841 375 45056 0 -1000 auditd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004352] [ 441] 100 441 6680 1009 77824 0 0 systemd-network
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004355] [ 446] 101 446 6001 1658 86016 0 0 systemd-resolve
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004357] [ 534] 0 534 60348 354 102400 0 0 accounts-daemon
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004359] [ 535] 0 535 637 165 40960 0 0 acpid
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004361] [ 536] 0 536 200300 13159 385024 0 0 amazon-cloudwat
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004363] [ 540] 103 540 1920 952 53248 0 -900 dbus-daemon
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004365] [ 568] 0 568 20476 612 61440 0 0 irqbalance
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004367] [ 570] 113 570 3256 342 53248 0 0 chronyd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004369] [ 577] 0 577 7494 2846 98304 0 0 networkd-dispat
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004370] [ 580] 113 580 1210 439 53248 0 0 chronyd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004372] [ 603] 0 603 2168 585 57344 0 0 cron
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004375] [ 609] 0 609 59107 779 94208 0 0 polkitd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004377] [ 617] 104 617 56125 866 86016 0 0 rsyslogd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004379] [ 621] 0 621 12231 6333 126976 0 0 salt-minion
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004381] [ 628] 0 628 4307 1002 69632 0 0 systemd-logind
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004382] [ 632] 0 632 98669 829 131072 0 0 udisksd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004384] [ 636] 0 636 951 499 49152 0 0 atd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004386] [ 677] 0 677 1840 447 53248 0 0 agetty
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004388] [ 680] 0 680 4561 533 61440 0 0 wrapper
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004389] [ 700] 0 700 3047 932 61440 0 -1000 sshd
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004391] [ 735] 0 735 60152 1054 102400 0 0 ModemManager
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004393] [ 736] 0 736 1459 385 49152 0 0 agetty
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004395] [ 737] 0 737 27031 2719 110592 0 0 unattended-upgr
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004397] [ 857] 0 857 1461116 107789 1564672 0 0 java
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004399] [ 928] 0 928 248205 12742 299008 0 0 salt-minion
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004401] [ 1041] 0 1041 31500 6490 143360 0 0 salt-minion
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004403] [ 1082] 0 1082 9519 580 69632 0 0 master
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004405] [ 1084] 112 1084 9670 566 65536 0 0 qmgr
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004407] [ 3599] 112 3599 10536 724 69632 0 0 tlsmgr
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004409] [ 97274] 0 97274 307044 307 163840 0 0 newrelic-infra-
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004411] [ 97282] 0 97282 440719 4508 294912 0 0 newrelic-infra
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004413] [ 251487] 112 251487 9585 121 65536 0 0 pickup
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004416] [ 257287] 0 257287 2553 624 57344 0 0 cron
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004418] [ 257288] 0 257288 2553 624 57344 0 0 cron
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004419] [ 257289] 0 257289 2553 623 57344 0 0 cron
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004421] [ 257290] 0 257290 2553 623 57344 0 0 cron
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004423] [ 257292] 3001 257292 2189 118 49152 0 0 bash
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004425] [ 257293] 3001 257293 2189 115 57344 0 0 bash
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004427] [ 257294] 3001 257294 2189 119 49152 0 0 bash
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004429] [ 257296] 3001 257296 2189 118 57344 0 0 bash
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004431] [ 257307] 3001 257307 656 29 40960 0 0 run-one
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004433] [ 257308] 3001 257308 656 29 40960 0 0 run-one
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004434] [ 257309] 3001 257309 656 29 40960 0 0 run-one
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004436] [ 257310] 3001 257310 656 29 40960 0 0 run-one
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004438] [ 257344] 3001 257344 1859 24 53248 0 0 flock
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004440] [ 257346] 3001 257346 7911996 7731555 62894080 0 0 python
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004442] [ 257347] 3001 257347 1859 24 61440 0 0 flock
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004443] [ 257348] 3001 257348 1859 24 49152 0 0 flock
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004445] [ 257349] 3001 257349 338380 158223 2191360 0 0 python
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004447] [ 257351] 3001 257351 365119 184973 2453504 0 0 python
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004449] [ 257355] 3001 257355 1859 25 53248 0 0 flock
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004451] [ 257356] 3001 257356 7883801 7703167 62664704 0 0 python
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004452] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cron.service,task=python,pid=257346,uid=3001
Mar 8 05:16:02 ip-10-110-10-133 kernel: [203986.004468] Out of memory: Killed process 257346 (python) total-vm:31647984kB, anon-rss:30926220kB, file-rss:0kB, shmem-rss:0kB, UID:3001 pgtables:61420kB oom_score_adj:0
Mar 8 05:17:01 ip-10-110-10-133 CRON[258623]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Mar 8 05:19:31 ip-10-110-10-133 postfix/smtp[258801]: D826A400AE: to=<system@example.com>, relay=email-smtp.us-east-1.amazonaws.com[52.20.44.183]:587, delay=0.39, delays=0.01/0.02/0.13/0.23, dsn=2.0.0, status=sent (250 Ok 01000186bfa906f8-4a6fed17-2ceb-423d-8476-e161c40962b0-000000)
Mar 8 05:19:31 ip-10-110-10-133 postfix/smtp[258801]: D826A400AE: to=<system@example.com>, relay=email-smtp.us-east-1.amazonaws.com[52.20.44.183]:587, delay=0.39, delays=0.01/0.02/0.13/0.23, dsn=2.0.0, status=sent (250 Ok 01000186bfa906f8-4a6fed17-2ceb-423d-8476-e161c40962b0-000000)
Mar 8 05:19:31 ip-10-110-10-133 postfix/smtp[258801]: D826A400AE: to=<system@example.com>, relay=email-smtp.us-east-1.amazonaws.com[52.20.44.183]:587, delay=0.39, delays=0.01/0.02/0.13/0.23, dsn=2.0.0, status=sent (250 Ok 01000186bfa906f8-4a6fed17-2ceb-423d-8476-e161c40962b0-000000)
Mar 8 05:23:03 ip-10-110-10-133 newrelic-infra-service[97282]: time="2023-03-08T05:23:03Z" level=warning msg="commands poll failed" component=CommandChannelService error="command request submission failed: Get \"https://infrastructure-command-api.newrelic.com/agent_commands/v1/commands\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
Mar 8 05:27:24 ip-10-110-10-133 newrelic-infra-service[97282]: time="2023-03-08T05:27:24Z" level=warning msg="error occurred while updating the system fingerprint" component=Agent error="unable to fetch AWS metadata: Get \"http://169.254.169.254/latest/dynamic/instance-identity/document\": dial tcp 169.254.169.254:80: i/o timeout"
Mar 8 05:29:29 ip-10-110-10-133 newrelic-infra-service[97282]: time="2023-03-08T05:27:24Z" level=warning msg="commands poll failed" component=CommandChannelService error="command request submission failed: Get \"https://infrastructure-command-api.newrelic.com/agent_commands/v1/commands\": dial tcp 162.247.242.49:443: i/o timeout (Client.Timeout exceeded while awaiting headers)"
Mar 8 05:34:38 ip-10-110-10-133 systemd[1]: Starting Ubuntu Advantage Timer for running repeated jobs...
Mar 8 06:01:58 ip-10-110-10-133 systemd[1]: collector.service: Main process exited, code=exited, status=1/FAILURE
Mar 8 06:07:08 ip-10-110-10-133 systemd[1]: collector.service: Failed with result 'exit-code'.
Mar 8 06:09:58 ip-10-110-10-133 systemd-networkd[441]: ens5: Could not set DHCPv4 address: Connection timed out
Mar 8 06:12:04 ip-10-110-10-133 systemd-networkd[441]: ens5: Failed
Right after the OOM, a lot of services start failing, in a way that looks like it's not just DNS. Why would this be the case? Shouldn't the killed process be the only one affected in an OOM?
Also, what's puzzling is that somehow smtpd
prints a success message in relaying an email after the incident -- I'm not sure if this is a red herring or not, but all other services report network errors after the OOM. A reboot of course fixes all issues.