1

My application makes a lot of allocations of exactly 24 bytes, but I am using a third party library that requires the allocator to provide a minimum of 16-byte alignment.

So, if I compile jemalloc configured for 8-bye alignment (--with-lg-quantum=3), I get a 24-byte allocation but my third party library fails.

If I compile jemalloc configured for 16-bye alignment (--with-lg-quantum=4), my malloc((size_t)24) calls allocate 32 bytes. This is a 33.3% increase memory usage. However, I need the regular malloc((size_t)24) calls to allocate 16-byte aligned (therefore 32 bytes) so my third party library works.

How can I allocate from my application 24-byte blocks (8-byte aligned) to use memory efficiently?

I tried aligned_alloc(8, 24), but it still allocates 32-bytes, 16-byte aligned.

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
LeoMurillo
  • 6,048
  • 1
  • 19
  • 34
  • 4
    Use a custom allocator for that, which only serves 24-byte 8-aligned allocations. Problem solved. (Naturally, that assumes you can ensure all those allocations and deallocations without fail use your custome allocator instead of the general one.) – Deduplicator Apr 06 '20 at 23:17
  • 2
    So, how can you get 24-bit alignment when 24 is not divisible by 16? Yes, you'll have a waste of 8 bytes, but I don't see any other choice. You can't allocate 24 bytes on a 16-byte boundary without having any padding. – Thomas Matthews Apr 06 '20 at 23:21
  • @ThomasMatthews I don't need the 24-byte to be 16-byte aligned. jemalloc has an 8-byte allocation size class even when configured for 16-byte alignment. I'm trying to find how to do the same with 24-bytes, but having regular 24-byte malloc requests still returning 16-byte aligned, so it doesn't break the 3rd party lib. – LeoMurillo Apr 06 '20 at 23:27
  • @LeoMurillo "*jemalloc has an 8-byte allocation size class even when configured for 16-byte alignment*" I don't know jemalloc, but any 16-byte aligned allocation will *have* to be 8-byte aligned, because any number which is divisible by 2^4 is by definition divisible by 2^3. You can't get that behavior with 24 because 24 is not divisible by 16. – Nicol Bolas Apr 06 '20 at 23:35

2 Answers2

8

If you are making a lot of allocations of exactly 24 bytes, and the memory efficiency of those allocations is a concern, you shouldn't be using malloc directly at all. You should be using a pool allocator with a size of 24 bytes.

Pool allocators allocate a large chunk of memory from the heap, then divide it up into fixed-size blocks of the size you specify. This way, you not only avoid alignment overhead, you also avoid the overhead of the information used by the heap to keep track of free blocks of data. You also help avoid fragmentation caused by such tiny allocations. Freeing those allocations is quite fast (as is making them). And so forth.

With a pool allocator, you ought to be able to allocate exactly what you need, without disturbing the library you're working with. You can allocate a slab of memory for 10,000 24-byte blocks, and the only overhead will be the bookkepping needed to keep track of free blocks (which can mostly use the free blocks themselves if you're clever).

Nicol Bolas
  • 449,505
  • 63
  • 781
  • 982
  • The space overhead is higher when performance is a consideration. Searching 10000 bits is a huge time waster, even if you're checking 32 bits at a time. But, yes, a pool allocator is the answer. – user3386109 Apr 07 '20 at 00:07
  • Thanks, I'll look into this. My first feeling is I am transforming an efficiency problem into a fragmentation problem. I'd need some defrag logic. I don't know how many 24-blocks I will need up front, and they come and go. – LeoMurillo Apr 07 '20 at 00:13
  • 1
    How does this achieve 16-byte alignment for each of the 24-byte blocks as the OP requires? – Avi Berger Apr 07 '20 at 01:59
  • @AviBerger: Because that's not what the OP asked for. The OP wants *their* allocations of 24-bytes to only take up 24-bytes (with 8-byte alignment), but the OP wants this other library to be able to have the 16-byte aligned allocations that it requires. Therefore, the OP should use their own allocator distinct from the standard one the library uses. – Nicol Bolas Apr 07 '20 at 02:33
  • You could be right, in which case you have a good answer. I asked the wrong question. I should have asked if the OP was passing any of those allocated blocks to the third party library. I rashly assumed he was. Rereading the question more carefully now, it isn't clear to me whether or not that is the case. – Avi Berger Apr 07 '20 at 03:26
  • 1
    @NicolBolas: *You can allocate a slab of memory for 10,000 24-byte blocks*: if he does that, every other object will be misaligned. *the only overhead will be the 10,000 bits needed to keep track of which blocks are free and which blocks are not.* A bitmap is not even needed: a counter and a free list consume less space and are simpler to use. – chqrlie Apr 07 '20 at 14:45
  • @AviBerger: I am not my passing my 24-byte contiguous blocks (8-byte aligned) to the third party lib. I need malloc(24) to return 16-byte aligned, so the 3rd party lib works. But I want to be able to allocate my app's millions of 24-byte chunks 8-byte aligned for efficiency – LeoMurillo Apr 07 '20 at 16:34
  • 1
    @LeoMurillo: Good. You need to do the sort of thing in NicolBolas' answer. If you are multi-threaded, you need to factor in how this interacts with your threading model. You've tagged both C and C++, so we're not sure what you're using. If C++, you might check out the Boost Pool library to see if it would be useful or give you implementation ideas. – Avi Berger Apr 07 '20 at 17:38
2

Your application allocates a lot of 24-byte objects and you want to combine these objectives:

  • align each 24-byte object on a 16-byte boundary, presumably to read its contents with SIMD instructions
  • use as little memory as possible, ideally just 24 bytes per object.

These objectives are incompatible: if the objects require a 16-byte alignment, they must be at least 32 bytes apart, regardless of how you allocate memory.

The C library malloc() probably enforces 16-byte alignment on your system already (it is a common requirement on 64-bit systems for SIMD compatible data), but could use the 8-byte slack at the end of the block for its own bookkeeping data. jemalloc() certainly does. So the overhead is not wasted but inherent to the allocation algorithm.

Allocating objects in pools does not help with the packing, because of the alignment constraint. It might be more efficient, but modern malloc() implementations are remarkably efficient and some do use thread-based pools (for example tcmalloc()).

Designing your own allocation scheme is tricky and error prone, linking a custom malloc() implementation is non trivial either as it may cause problems with C library functions' own use of malloc(). I would strongly advise against these approaches unless you are very proficient in C and have a good understanding of your system.

There is one possible direction to improve packing: if you also allocate many 8-byte objects, you could interlace them in combined pools of 32-byte chunks, using the first 24 bytes for a 24-byte object aligned on a 16-byte boundary and the 8 remaining bytes for a separate 8-byte object aligned on an 8-byte boundary.

Another approach would be to split the storage of your 24-byte objects into an array of 16-byte parts and another array of 8-byte parts using the same index to access the parts of the same logical object. If you know the maximum number of such objects to allocate, it is a workable solution. You would use index values instead of pointers to access the parts. This may require substantial modifications of your code.

Memory is quite cheap and abundant on current systems. Unless you target existing deployed embedded systems, specifying more RAM for your application is a simple and effective approach.

Here is a pool allocator for 24-byte objects with very small overhead. Try and see if you use less memory with it and get better performance:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>

typedef struct pool_link_t {
    struct pool_link_t *next;   // generic link for the free list
} pool_link_t;

typedef struct pool_page_t {
    struct pool_block_t *head;   // pointer to the block at the start of each page
} pool_page_t;

typedef struct pool_block_t pool_block_t;
struct pool_block_t {
    pool_block_t *head;         // at the start of each page
    pool_block_t *next, *prev;  // pool_block linkage
    size_t block_size;          // mmapped size
    size_t avail_page_count;    // number of unused pages
    size_t avail_count;         // number of unused objects
    size_t free_count;          // length of free list
    pool_link_t *free_list;     // free list
    pool_link_t *avail_ptr;     // pointer to unused object area
};

#define PAGE_SIZE        0x1000     // system dependent
#define POOL_BLOCK_SIZE  0x100000   // must be a multiple of PAGE_SIZE
#define POOL_OBJ_SIZE    24         // must be a multiple of sizeof(void*)

static pool_block_t dummy_arena = {
    &dummy_arena, &dummy_arena, &dummy_arena, 0, 0, 0, 0, NULL, NULL,
};
static pool_block_t *pool24 = &dummy_arena;

void *malloc24(void) {
    pool_block_t *p, *startp;
    for (startp = p = pool24;; pool24 = p = p->next) {
        if (p->free_count) {
            pool_link_t *link = p->free_list;
            p->free_list = link->next;
            p->free_count--;
            return link;
        }
        if (p->avail_count) {
            void *ptr = p->avail_ptr;
            p->avail_ptr += POOL_OBJ_SIZE / sizeof(pool_block_t*);
            if (--p->avail_count == 0) {
                if (p->avail_page_count) {  // prep the next page of the block
                    pool_page_t *page = (void *)((unsigned char *)p + POOL_BLOCK_SIZE - p->avail_page_count * PAGE_SIZE);
                    page->head = p;
                    p->avail_ptr = (void *)(page + 1);
                    p->avail_count = (PAGE_SIZE - sizeof(pool_block_t*)) / POOL_OBJ_SIZE;
                    p->avail_page_count--;
                }
            }
            return ptr;
        }
        if (p->next == startp) {
            pool_block_t *np = mmap(NULL, POOL_BLOCK_SIZE,
                                    PROT_READ | PROT_WRITE,
                                    MAP_ANON | MAP_PRIVATE, -1, 0);
            if (np == MAP_FAILED)
                return NULL;
            np->head = np;
            np->block_size = POOL_BLOCK_SIZE;
            // prep the first page of the block
            np->avail_page_count = POOL_BLOCK_SIZE / PAGE_SIZE - 1;
            np->avail_count = (PAGE_SIZE - sizeof(pool_block_t)) / POOL_OBJ_SIZE;
            np->avail_ptr = (void *)(np + 1);
            np->free_count = 0;
            np->free_list = NULL;
            // link the block in the arena
            np->prev = p;
            np->next = p->next;
            p->next = np->next->prev = np;
        }
    }
}

void free24(void *p) {
    pool_link_t *lp;
    if ((lp = p) != NULL) {
        pool_block_t *np = (void *)((uintptr_t)p & ~(PAGE_SIZE - 1));
        np = np->head;
        lp->next = np->free_list;
        np->free_list = lp;
        np->free_count++;
    }
}

void trim_arena24(void) {
    pool_block_t *p;
    pool24 = &dummy_arena;
    while ((p = dummy_arena.next) != &dummy_arena) {
        if (p->free_count == (PAGE_SIZE - sizeof(pool_block_t)) / POOL_OBJ_SIZE +
            (PAGE_SIZE - sizeof(pool_block_t*)) / POOL_OBJ_SIZE * (POOL_BLOCK_SIZE / PAGE_SIZE - 1 - p->avail_page_count)) {
            dummy_arena.next = p->next;
            p->next->prev = p->prev;
            munmap(p, p->block_size);
        }
    }
}

void free_arena24(void) {
    pool_block_t *p;
    pool24 = &dummy_arena;
    while ((p = dummy_arena.next) != &dummy_arena) {
        dummy_arena.next = p->next;
        p->next->prev = p->prev;
        munmap(p, p->block_size);
    }
}

#define TRACE(s)  //s
#define TEST_COUNT (16 << 20)
static void *ptr[TEST_COUNT];

#ifdef BENCH_REF
#define malloc24()  malloc(24)
#define free24(p)   free(p)
#endif

int main(void) {
    int i;

    TRACE(printf("testing %d\n", TEST_COUNT));
    for (i = 0; i < TEST_COUNT; i++) {
        ptr[i] = malloc24();
        TRACE(printf("%d: malloc24() -> %p\n", i, ptr[i]));
    }
    for (i = 0; i < TEST_COUNT; i++) {
        int n = rand() % TEST_COUNT;
        if (ptr[n]) {
            TRACE(printf("%d: free24(%p)\n", n, ptr[n]));
            free24(ptr[n]);
            ptr[n] = NULL;
        }
    }
    for (i = 0; i < TEST_COUNT; i++) {
        if (!ptr[i]) {
            ptr[i] = malloc24();
            TRACE(printf("%d: malloc24() -> %p\n", i, ptr[i]));
        }
    }
    for (i = 0; i < TEST_COUNT; i++) {
        TRACE(printf("%d: free24(%p)\n", i, ptr[i]));
        free24(ptr[i]);
        ptr[i] = NULL;
    }
    TRACE(printf("trim_arena24()\n"));
    trim_arena24();
    if (pool24 != &dummy_arena) printf("pool24 != &dummy_arena\n");
    if (pool24->next != pool24) printf("pool24->next != pool24\n");
    if (pool24->prev != pool24) printf("pool24->prev != pool24\n");
    TRACE(printf("free_arena24()\n"));
    free_arena24();
    TRACE(printf("done\n"));
    return 0;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • *Memory is quite cheap and abundant on current systems... specifying more RAM for your application is a simple and effective approach.* I never thought I would make such a statement, I have always thrived to use memory and CPU sparingly, it mus be a side effect of mandated confinement :) – chqrlie Apr 07 '20 at 14:47
  • Thanks chqrlie, I do need regular malloc(24) calls to return 16-byte aligned for the 3rd party lib to function. But I want to be able to allocate millions of 24-byte blocks (8-byte aligned) for my app's own use. If I set up jemalloc as 8-byte minimum alignment, I do get a 24-byte size class and my 24-byte allocations are contiguous. I'm trying to find how, if possible, can I do the same when I use 16-byte minimum alignment through some function such as posix_memalign() or aligned_alloc(). So far no luck. – LeoMurillo Apr 07 '20 at 16:39
  • @LeoMurillo: can you update the question with information about your target architecture? – chqrlie Apr 07 '20 at 17:17
  • @LeoMurillo: if only a subset of the objects need 16-byte alignment, you could try a pool based approach with 2 freelists, one for the 16-byte aligned objects and one for the other objects. The main problem is de-allocation: if you have global freelists, you won't be able to release memory to the system until all objects have been freed. If you use different freelists for each block, that's feasible but you need a quick way to locate the base of the pool block. – chqrlie Apr 07 '20 at 17:23
  • 1
    You can play tricks with the low order bits of the pointer values to locate the beginning of the page, where you would store a pointer to the beginning of the block of pages where you have the counters, the freelists and links to the previous and next block of pages. The blocks of pages would be mmapped, hence page aligned. – chqrlie Apr 07 '20 at 17:24
  • @LeoMurillo: I updated the answer with a sample pool allocator. 24-byte objects are allocated with `malloc24()` and must be freed with `free24()`. The pool can be trimmed with `trim_arena24()` and freed with `free_arena24()`. Caveat: It does not support multi-threading. – chqrlie Apr 07 '20 at 20:53
  • Thank you so much! I'll try this – LeoMurillo Apr 07 '20 at 22:17