2

I have a process which slowly consumes more RAM until it eventually hits its cgroup limit and is OOM killed, and I'm trying to figure out why.

Oddly, go's runtime seems to think not much RAM is used, whereas the OS seems to think a lot is used.

Specifically, looking at runtime.MemStats (via the extvar package) I see:

"Alloc":51491072,
"TotalAlloc":143474637424,
"Sys":438053112,
"Lookups":0,
"Mallocs":10230571,
"Frees":10195515,
"HeapAlloc":51491072,
"HeapSys":388464640,
"HeapIdle":333824000,
"HeapInuse":54640640,
"HeapReleased":0,
"HeapObjects":35056,
"StackInuse":14188544,
"StackSys":14188544,
"MSpanInuse":223056,
"MSpanSys":376832,
"MCacheInuse":166656,
"MCacheSys":180224,
"BuckHashSys":2111104,
"GCSys":13234176,
"OtherSys":19497592

But from the OS perspective:

$ ps auxwf
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       178  0.0  0.0   3996  3372 pts/0    Ss   17:33   0:00 bash
root       246  0.0  0.0   7636  2828 pts/0    R+   17:59   0:00  \_ ps auxwf
root         1  166  2.8 11636248 5509288 ?    Ssl  17:24  57:15 app server -api-public

So, the OSS reports an RSS of 5380 MiB, but the Sys field in MemStats shows only 417 MiB. My understanding is these fields should be approximately the same.

GC is running, as confirmed by setting GODEBUG=gctrace=1,madvdontneed=1. For example, I see output like:

 gc 6882 @2271.137s 0%: 0.037+2.2+0.087 ms clock, 3.5+0.78/37/26+8.4 ms cpu, 71->72->63 MB, 78 MB goal, 96 P

The numbers vary a bit depending on the process, but they are all <100 MB, whereas the OS is reporting >1GB (and growing, until eventual OOM).

madvdontneed=1 was a shot in the dark but seems to make no difference. I wouldn't think the madvise parameters would be relevant, since it doesn't seem there's any need to return memory to the kernel, as the Go runtime doesn't think it's using much memory anyway.

What could explain this discrepancy? Am I not correctly understanding the semantics of these fields? Are there mechanisms that would result in the growth of RSS (and an eventual OOM kill) but not increase MemStats.Sys?

Phil Frost
  • 3,668
  • 21
  • 29
  • The `HeapReleased: 0` is unusual here, since it has obviously allocated much more than that. Are you on the latest version? I would see if using `gctrace=1` shows any useful info, especially around the scavenger process. You could also try setting `madvdontneed=1` to see if it helps. I would expect the OS to reclaim memory marked with `MADV_FREE` before invoking OOM, but who knows. – JimB Oct 30 '20 at 18:24
  • I'm on go1.12.17, and have been thinking along similar lines. But if my understanding of `Sys` is correct, whatever is allocating the memory must not be the go allocator, right? And if `HeapReleased` is zero, I wouldn't think `madvdontneed=1` would do anything. – Phil Frost Oct 30 '20 at 18:39
  • Yeah, that released stat is unusual, and the CPU at 166% might be suspicious too. What does the CPU look like when the app is idle (note that go1.12 is no longer supported if this is an actively developed application) – JimB Oct 30 '20 at 18:43
  • @JimB This is a webserver process where we allocate 4 vCPUs and target 60% utilization with autoscaling. So 166% CPU isn't unusual. Yeah, 1.12 is old. You know how it is... :( – Phil Frost Oct 30 '20 at 18:46
  • My only guess right now is that you have a busy loop tying up the runtime preventing the scavenger from running, which would show up as 100% cpu usage even when idle. – JimB Oct 30 '20 at 18:48
  • @JimB gctrace confirms gc is running, see edit – Phil Frost Oct 30 '20 at 20:59
  • GC doesn't release memory, the scavenger does which is a separate process. Another possibility here is that if you're using cgo and the memory is allocated in C. – JimB Oct 30 '20 at 21:33
  • Hitting the same problem.Were you able to get to the bottom of this? @PhilFrost – DntFrgtDSemiCln Feb 07 '22 at 22:49

1 Answers1

0

This is due to MADV_FREE (which was used by default in certain versions of Go -- 1.12 through 1.15) to count towards RSS, even though it's available for reuse by the system. This can cause the Linux OOM killer to select a process with a large amount of MADV_FREE memory for killing, since the RSS is artificially high.

This question is very similar to Why Golang MADV_FREE leads to OOM sometimes?

jesup
  • 6,765
  • 27
  • 32