overhead for an empty heap arena

Question

My tools are Linux, gcc and pthreads. When my program calls new/delete from several threads, and when there is contention for the heap, 'arena's are created (see the following link for reference http://www.bozemanpass.com/info/linux/malloc/Linux_Heap_Contention.html). My program runs 24x7, and arenas are still occasionally being created after 2 weeks. I think there may eventually be as many arenas as threads. ps(1) shows alarming memory consumption, but I suspect that only a small portion of it is actually mapped.

What is the 'overhead' for an empty arena? (How much more memory per arena is used than if all allocation was confined to the traditional heap? )

Is there any way to force the creation in advance of n arenas? Is there any way to force the destruction of empty arenas?

There can be up to 65k arenas. The thread does not have some separate arenas, there is common pool for all threads. — osgx, Feb 06 '10 at 22:58
Try to: find memory leaks in programme; reuse threads as much as possible; use TCmalloc (from google). — osgx, Feb 06 '10 at 23:08
@osgx thanks for the ideas. I fixed the leaks before posting this question, and I know that because the traditional heap is not growing. — rleir, Feb 13 '10 at 14:52

score 1 · Accepted Answer · answered Feb 04 '10 at 16:31

1

struct malloc_state (aka mstate, aka arena descriptor) have size

glibc-2.2 (256+18)*4 bytes =~ 1 KB for 32 bit mode and ~2 KB for 64 bit mode. glibc-2.3 (256+256/32+11+NFASTBINS)*4 =~ 1.1-1.2 KB in 32bit and 2.4-2.5 KB for 64bit

See glibc-x.x.x/malloc/malloc.c file, struct malloc_state

answered Feb 04 '10 at 16:31

osgx

90,338
53
357
513

1

Don't you have to round it up to the next MMU paging block size? Thanks for the answer! – rleir Feb 05 '10 at 18:00
It is internal arena descriptor. Each arena descriptor is placed in mmap-ed segment. limit of 65k maximum of mmaps is hardcoded. Each mmap takes some resources from OS kernel (VMA). – osgx Feb 06 '10 at 22:32
All arena descriptors are in circularly linked list begins from main_arena. Every new arena is placed in begin of mmap-ed region with offset of sizeof(heap_info) = 4xsizeof(void*) = 16 or 32 bytes. The heap (mmaped segment) is aligned and have size from HEAP_MIN_SIZE to HEAP_MAX_SIZE. It have native alignment of mmap's calls (= page = 4k). The rest of heap (after heap_info and mstate) is used for malloc_chunks (malloced data). – osgx Feb 06 '10 at 22:54
Sorry, HEAP_MIN_SIZE = 32*1024 (32KB) HEAP_MAX_SIZE = 1024*1024 (1MB) – osgx Feb 06 '10 at 22:55
HEAP_MAX_SIZE = 1MB is the max size of ARENA. So it will be a LOT of arenas in big programme. – osgx Feb 06 '10 at 23:32

osgx · Answer 2 · 2010-02-06T23:33:57.080

Destruction of arenas... I don't know yet, but there is such text (briefly - it says NO to the possibility of destruction/trimming memory ) from analysis http://www.citi.umich.edu/techreports/reports/citi-tr-00-5.pdf from 2000 (*a bit outdated). Please name your glibc version.

Ptmalloc maintains a linked list of subheaps. To re-
duce lock contention, ptmalloc searchs for the first
unlocked subheap and grabs memory from it to fulfill
a malloc() request. If ptmalloc doesn’t find an
unlocked heap, it creates a new one. This is a simple
way to grow the number of subheaps as appropriate
without adding complicated schemes for hashing on
thread or processor ID, or maintaining workload sta-
tistics. However, there is no facility to shrink the sub-
heap list and nothing stops the heap list from growing
without bound.

There is a code for heap (aka arena) trimming (heap_trim). But it works only for completely free arena. — osgx, Feb 06 '10 at 23:13
Such "simple way" of growing subheap number will lead to continuous creation of arenas (subheaps). The arena number can grow also because of heap fragmentation. — osgx, Feb 16 '10 at 23:31

osgx · Answer 3 · 2010-02-06T22:54:41.713

from malloc.c (glibc 2.3.5) line 1546

/*
  -------------------- Internal data structures --------------------
   All internal state is held in an instance of malloc_state defined
   below. 
 ...
   Beware of lots of tricks that minimize the total bookkeeping space
   requirements. **The result is a little over 1K bytes** (for 4byte
   pointers and size_t.)
*/

The same result as I got for 32-bit mode. The result is a little over 1K bytes

score 0 · Answer 4 · answered Feb 06 '10 at 23:31

0

Consider using of TCmalloc form google-perftools. It just better suited for threaded and long-living applications. And it is very FAST. Take a look on http://goog-perftools.sourceforge.net/doc/tcmalloc.html especially on graphics (higher is better). Tcmalloc is twice better than ptmalloc.

answered Feb 06 '10 at 23:31

osgx

90,338
53
357
513

Thanks for the idea. Note: the original question is not about speed, I do not need it to be faster. – rleir Feb 13 '10 at 14:54
High speed is a bonus there :) – osgx Feb 15 '10 at 14:54

score 0 · Answer 5 · answered May 05 '11 at 05:00

In our application the main cost of multiple arenas has been "dark" memory. Memory allocated by the OS, which we don't have any references to.

The pattern you can see is

Thread X goes goes to alloc, hits a collision, creates a new arena.
Thread X makes some large allocations.
Thread X makes some small allocation(s).
Thread X stops allocating.

Large allocations are freed. But the whole arena at the high water mark of the last currently active allocation is still using up VMEM, and other threads won't use this arena unless they hit contention in the main arena.

Basically it's a contributor to "memory fragmentation", since there are multiple places memory can be available, but needing to grow an arena is not a reason to look in other arenas. At least I think that's the cause, the point is your application can end up with a bigger VM footprint than you think it should have. This mostly hits you if you have limited swap, since as you say most of this ends up paged out.

Our (memory hungry) application can have 10s of percent of memory "wasted" in this way, and it can really bite in some situations.

I'm not sure why you would want to create empty arenas. If allocations and frees are in the same thread as each other, then I think over time you will tend to all of them being in the same thread-specific arena with no contention. You may have some small blips while you get there, so maybe that's a reason.

Thanks for this. I would like to select this answer as 'best' tied with osgx's answers. — rleir, Dec 26 '11 at 11:44

overhead for an empty heap arena

5 Answers5

Linked