2

I am trying to figure out why my resident memory for one version of a program ("new") is much higher (5x) than another version of the same program ("baseline"). The program is running on a Linux cluster with E5-2698 v3 CPUs and written in C++. The baseline is a multiprocess program, and the new one is a multithreaded program; they are both fundamentally doing the same algorithm, computation, and operating on the same input data, etc. In both, there are as many processes or threads as cores (64), with threads pinned to CPUs. I've done a fair amount of heap profiling using both Valgrind Massif and Heaptrack, and they show that the memory allocation is the same (as it should be). The RSS for both the baseline and new version of the program are larger than the LLC.

The machine has 64 cores (hyperthreads). For both versions, I straced relevant processes and found some interesting results. Here's the strace command I used:

strace -k -p <pid> -e trace=mmap,munmap,brk

Here are some details about the two versions:

Baseline Version:

  • 64 processes
  • RES is around 13 MiB per process
  • using hugepages (2MB)
  • no malloc/free-related syscalls were made from the strace call listed above (more on this below)

top output Baseline top

New Version

  • 2 processes
  • 32 threads per process
  • RES is around 2 GiB per process
  • using hugepages (2MB)
  • this version does a fair amount of memcpy calls of large buffers (25MB) with default settings of memcpy (which, I think, is supposed to use non-temporal stores but I haven't verified this)
  • in release and profile builds, many mmap and munmap calls were generated. Curiously, none were generated in debug mode. (more on that below).

top output (same columns as baseline) New top

Assuming I'm reading this right, the new version has 5x higher RSS in aggregate (entire node) and significantly more page faults as measured using perf stat when compared to the baseline version. When I run perf record/report on the page-faults event, it's showing that all of the page faults are coming from a memset in the program. However, the baseline version has that memset as well and there are no pagefaults due to it (as verified using perf record -e page-faults). One idea is that there's some other memory pressure for some reason that's causing the memset to page-fault.

So, my question is how can I understand where this large increase in resident memory is coming from? Are there performance monitor counters (i.e., perf events) that can help shed light on this? Or, is there a heaptrack- or massif-like tool that will allow me to see what is the actual data making up the RES footprint?

One of the most interesting things I noticed while poking around is the inconsistency of the mmap and munmap calls as mentioned above. The baseline version didn't generate any of those; the profile and release builds (basically, -march=native and -O3) of the new version DID issue those syscalls but the debug build of the new version DID NOT make calls to mmap and munmap (over tens of seconds of stracing). Note that the application is basically mallocing an array, doing compute, and then freeing that array -- all in an outer loop that runs many times.

It might seem that the allocator is able to easily reuse the allocated buffer from the previous outer loop iteration in some cases but not others -- although I don't understand how these things work nor how to influence them. I believe allocators have a notion of a time window after which application memory is returned to the OS. One guess is that in the optimized code (release builds), vectorized instructions are used for the computation and it makes it much faster. That may change the timing of the program such that the memory is returned to the OS; although I don't see why this isn't happening in the baseline. Maybe the threading is influencing this?

(As a shot-in-the-dark comment, I'll also say that I tried the jemalloc allocator, both with default settings as well as changing them, and I got a 30% slowdown with the new version but no change on the baseline when using jemalloc. I was a bit surprised here as my previous experience with jemalloc was that it tends to produce some some speedup with multithreaded programs. I'm adding this comment in case it triggers some other thoughts.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Kulluk007
  • 902
  • 2
  • 10
  • 24
  • Are you sure the baseline version isn't optimizing malloc+memset into `calloc` which leaves pages untouched? Does the change between versions maybe let the system use transparent hugepages differently, in a way that happens to not be good for your workload? Or maybe just different allocation / free is making your allocator hand pages back to the OS instead of keeping them in a free list, resulting in a page fault after each allocation. Maybe `strace` for `mmap` / `munmap` or `brk` system calls. – Peter Cordes May 11 '20 at 19:33
  • I think your questions are on the right track, although I don't understand some of the things you said. The baseline is implemented as using many processes and no threads; the new one has many threads within the same process. There may be some optimization that is now possible with the threaded version. I am using hugepages (2MB) although I don't really understand how that is affected by threads, etc. I will also try strace as well. I do see perf top showing that the new version is spending time in `syscall_return_via_sysret` so there is likely some syscalls happening. – Kulluk007 May 11 '20 at 20:00
  • Put details like that in your question; that's a huge change. There are some ways to find out which pages are resident (`mincore()`) but working out any meaning to that is doing to depend on what kind of change you made! – Peter Cordes May 11 '20 at 20:10
  • @PeterCordes you're right -- I added additional details and found some interesting differences due to `strace` as you suggested using. Please see the updated post. – Kulluk007 May 12 '20 at 01:44
  • Ok, well that's your source of pagefaults. The `strace` test is pretty definitive, and a backtrace of munmap calls could identify the guilty `free` calls. To fix it, see https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html / http://man7.org/linux/man-pages/man3/mallopt.3.html, especially M_MMAP_THRESHOLD (raise it to get glibc malloc not to use mmap for your arrays?). I haven't played with the parameters before. The man page mentions something about a dynamic mmap threshold. – Peter Cordes May 12 '20 at 01:57

1 Answers1

1

In general: GCC can optimize malloc+memset into calloc which leaves pages untouched. If you only actually touch a few pages of a large allocation, that not happening could account for a big diff in page faults.

Or does the change between versions maybe let the system use transparent hugepages differently, in a way that happens to not be good for your workload?

Or maybe just different allocation / free is making your allocator hand pages back to the OS instead of keeping them in a free list. Lazy allocation means you get a soft page fault on the first access to a page after getting it from the kernel. strace to look for mmap / munmap or brk system calls.


In your specific case, your strace testing confirms that your change led to malloc / free handing pages back to the OS instead of keeping them on a free list.

This fully explains the extra page faults. A backtrace of munmap calls could identify the guilty free calls. To fix it, see https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html / http://man7.org/linux/man-pages/man3/mallopt.3.html, especially M_MMAP_THRESHOLD (perhaps raise it to get glibc malloc not to use mmap for your arrays?). I haven't played with the parameters before. The man page mentions something about a dynamic mmap threshold.


It doesn't explain the extra RSS; are you sure you aren't accidentally allocating 5x the space? If you aren't, perhaps better alignment of the allocation lets the kernel use transparent hugepages where it didn't before, maybe leading to wasting up to 1.99 MiB at the end of an array instead of just under 4k? Or maybe Linux wouldn't use a hugepage if you only allocated the first couple of 4k pages past a 2M boundary.

If you're getting the page faults in memset, I assume these arrays aren't sparse and that you are touching every element.


I believe allocators have a notion of a time window after which application memory is returned to the OS

It would be possible for an allocator to check the current time every time you call free, but that's expensive so it's unlikely. It's also very unlikely that they use a signal handler or separate thread to do a periodic check of free-list size.

I think glibc just uses a size-based heuristic that it evaluates on every free. As I said, the man page mentions something about heuristics.

IMO actually tuning malloc (or finding a different malloc implementation) that's better for your situation should probably be a different question.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thank you - this is great. I was able to remove the extra syscalls via an application change although `M_MMAP_THRESHOLD` is something I will use if necessary. – Kulluk007 May 13 '20 at 01:25
  • Regarding the allocator time window comment: I got this notion from jemalloc -- see https://github.com/jemalloc/jemalloc/blob/dev/TUNING.md and specifically `dirty_decay_ms` and `muzzy_decay_ms`. I haven't see anything about the glibc malloc having this, though. – Kulluk007 May 13 '20 at 01:28
  • An unsolved issue is the RSS overhead -- top still shows the same large RSS. I'm also using https://github.com/brendangregg/wss (wss.pl) to verify that my working set size, even after fixing the page fault issues, is still large in the new version. When you mention alignment issues to enable THP optimizations -- what are the best practices here? Should I be using `posix_memalign` with a 2MB alignment for these 26MB allocations? – Kulluk007 May 13 '20 at 01:32
  • @Kulluk007: If your code repeatedly uses all of the 26MB allocations then yes, `posix_memalign` or `aligned_alloc` to allocate 2M-aligned regions is probably good. IDK how they communicate this to mmap, or if they just over-allocate. Virtual address space is cheap (especially untouched pages) so it's fine. Then use `madvise(MADV_HUGE)` on your allocation to tell the kernel to prefer hugepages. Especially if you have `/sys/kernel/mm/transparent_hugepage/defrag` set to `defer+madvise` so madvise hints the kernel to spend time defragging to get 2M chunks of contiguous physical mem. – Peter Cordes May 13 '20 at 01:46
  • I've tried the `posix_memalign` and `memalign` suggestions but that doesn't change the amount of resident memory that the new version is using. I'm wondering if there are any direct ways to profile resident memory -- sort of like a valgrind massif for resident memory that maps back to the original allocation in the code. It seems like some tool is buildable given `mincore` and some bookkeeping. In any case, I'm curious if `mincore` is the best approach for assessing where the resident is coming from or if there are others to consider. – Kulluk007 May 18 '20 at 15:11
  • I didn't expect that alignment would help significantly with RSS, just performance from reduced dTLB misses. IDK, there might be tools for tracking residency of allocations. If you're expecting over-allocation but for some of it to be never touched (and thus never resident), you could look at `/proc//smaps` to see the Private_Dirty amount for each mapping. For anonymous mappings, written memory stays dirty indefinitely, I think. (Or maybe getting paged out to a swap file would count as clean, in which case it would be residency.) – Peter Cordes May 18 '20 at 17:14