3

I am doing C code, and I have several (millions) of malloc's each requesting for 20-30 bytes of memory.

As result the overhead of both GNU C Malloc and Jemalloc go to 40-50%. DL Malloc works better, but still ~30% overhead.

Is there a way to do malloc without any alignment / padding? I understand this will be slower and probably on some CPU's C need to "reconstruct" data from different words, but I am ready to trade speed for memory usage.

Instead of malloc I can use memory pool too, as long as it can reuse memory after free().

Nick
  • 9,962
  • 4
  • 42
  • 80
  • Is there a hard bound on the size of the blocks you are allocating, and how much higher is it than the average size? – Pascal Cuoq Jan 23 '15 at 12:29
  • no, but I plan to do some logic and to pass everything over say 100 bytes to standard malloc – Nick Jan 23 '15 at 12:48
  • Does your workload mix allocations and deallocations? Do you use threads? You could easily implement a simple pool allocator for each desired size, but deallocations become slow with many sizes. Specifying the size at deallocation time (`free(ptr, size)`) helps; then you only need to scan size-compatible pools, not all pools. If you don't deallocate individual elements but entire pools at a time, you could make allocation very fast and thread-safe using atomic built-ins. – Nominal Animal Jan 23 '15 at 14:49
  • no threads. generally when software is ready there will be dealocations. however mentioned overhead of 50% is without any dealocation and only reason is 16 bit padding. for example malloc of 17 bytes is rounded to 32 + 8 metadata = 40 bytes. – Nick Jan 23 '15 at 15:52

2 Answers2

4

malloc() et al. are required by the C standard to provide sufficiently aligned pointers for any data type, so to reduce the allocation overheads, you need to implement your own allocator (or use some existing one).

One possibility would be to have a pool chain for each possible allocation size. Within each pool, you can use a bit map to track which items are allocated and which free. The overhead is just over one bit per item, but you may end up having lots of pool chains; this tends to make free() slow, because it has to search for the correct pool.

A better possibility, I think, is to create a pool chain for small allocations. Within each pool, the small chunks form a linked list, with a single unsigned char keeping track of the length and allocation status. This yields unaligned pointers, but at a overhead of just over one char.

For example:

#include <stdlib.h>
#include <string.h>
#include <limits.h>
#include <errno.h>
#include <stdio.h>

#define SMALL_POOL_SIZE 131072
#define SMALL_LIMIT     ((UCHAR_MAX + 1U) / 2U)

struct small_pool {
    struct small_pool *next;
    unsigned int       size;
    unsigned int       used;
    unsigned char      data[];
};

static struct small_pool *list = NULL;

Within the data[] member, the first char is 1 to SMALL_LIMIT-1 for a free chunk of that size, or SMALL_LIMIT+1 or larger for an used chunk of 1 char or more. Zero and SMALL_LIMIT indicate errors. A space-aggressive allocator could be implemented for example as

void *small_alloc(const size_t size)
{
    struct small_pool *pool;

    if (size < 1 || size >= SMALL_LIMIT) {
        errno = EINVAL;
        return NULL;
    }

    pool = list;

    while (pool != NULL) {

        /* Unused space in the pool? */
        if (pool->used + size < pool->size) {
            unsigned char *const ptr = pool->data + pool->used;

            /* Grab it. */
            pool->used += size + 1U;

            /* Create new slot. */
            (*ptr) = size + SMALL_LIMIT;

            /* Done. */
            return ptr + 1U;
        }

        /* Check the free slots in the pool. */
        {
            unsigned char *const end = pool->data + pool->used;
            unsigned char       *ptr = pool->data;
            unsigned char    big_len = SMALL_LIMIT;
            unsigned char   *big_ptr = NULL;

            while (ptr < end)
                if (*ptr == 0U || *ptr == SMALL_LIMIT) {
                    /* Invalid pointer */
                    errno = EDOM;
                    return NULL;
                } else
                if (*ptr > SMALL_LIMIT) {
                    /* Used slot, skip */
                    ptr += (*ptr) - SMALL_LIMIT + 1U;
                    continue;
                } else {
                if (*ptr < size) {
                    /* Slot is too small, skip it */
                    ptr += (*ptr) + 1U;
                    continue;
                } else
                if (*ptr == size) {
                    /* Perfect slot; grab it. */
                    (*ptr) = size + SMALL_LIMIT;
                    return ptr + 1U;
                } else
                    /* Remember smallest of the large enough slots */
                    if (*ptr < big_len) {
                        big_len = *ptr;
                        big_ptr = ptr;
                    }
                    ptr += (*ptr) + 1U;
                }

            if (big_ptr != NULL) {
                (*big_ptr) = big_len + SMALL_LIMIT;
                return big_ptr + 1;
            }
        }

        /* Check the next pool. */
        pool = pool->next;
    }

    /* Need a new pool. */
    pool = malloc(SMALL_POOL_SIZE);
    if (pool == NULL) {
        errno = ENOMEM;
        return NULL;
    }

    /* Initialize pool; use the initial slot for the new allocation. */
    pool->used = size + 1;
    pool->size = SMALL_POOL_SIZE - sizeof (struct small_pool);
    pool->data[0] = size + SMALL_LIMIT;

    /* Prepend this pool to the pool chain. */
    pool->next = list;
    list = pool;

    /* Return the slot we used. */
    return pool->data + 1;
}

It has a simple strategy: If there is unused trailing space in the pool, use it. Otherwise, scan through the pool to find the unused slots. If there is a perfectly-sized slot, use it; otherwise, use the smallest sufficiently large unused slot.

There are many improvements possible. For example, you could keep full pools in a separate list, to avoid scanning through them. It might also be a good idea to move the pool where a free slot was found in, to the start of the pool list.

The deallocation is more complicated. If we have relatively few deallocations amidst allocations, and don't worry about freeing entire pools, the deallocation could be as simple as

int small_free(void *const item)
{
    if (item == NULL)
        return 0;
    else {
        struct small_pool *pool = list;

        while (pool != NULL && !((unsigned char *)item > pool->data && (unsigned char *)item < pool->data + pool->used))
            pool = pool->next;

        if (pool != NULL) {
            unsigned char *const ptr = (unsigned char *)item - 1;

            if (*ptr > SMALL_LIMIT)
                (*ptr) -= SMALL_LIMIT;

            return 0;
        }

        return ENOENT;    
    }
}

assuming you need the function to return ENOENT in case the allocation is not actually a small one. If it is important to verify the pointer to be deallocated is valid, say for debugging, then

int small_free(void *const item)
{
    if (item == NULL)
        return 0;
    else {
        struct small_pool *pool = list;

        while (pool != NULL && !((unsigned char *)item > pool->data && (unsigned char *)item < pool->data + pool->used))
            pool = pool->next;

        if (pool != NULL) {
            unsigned char *const end = pool->data + pool->used;
            unsigned char       *ptr = pool->data;

            while (ptr < end)
                if (*ptr == 0U || *ptr == SMALL_LIMIT)
                    return EDOM;
                else
                if (*ptr < SMALL_LIMIT) {
                    size_t len = (*ptr) + 1U;

                    /* Coalesce consecutive slots, if possible. */
                    while (len + ptr[len] < SMALL_LIMIT) {
                        (*ptr) = len + ptr[len];
                        len = (*ptr) + 1U;
                    }

                    ptr += len;

                } else {
                    const size_t len = (*ptr) + 1U - SMALL_LIMIT;

                    /* Within the current slot.. probably should just
                     * compare item to ptr+1 instead. */
                    if ((unsigned char *)item > ptr && (unsigned char *)item < ptr + len) {
                        *ptr = len - 1U;
                        return 0;
                    }

                    ptr += len;
                }
        }

        return ENOENT;    
    }
}

Even this latter version does not trim ->used when the last chunk(s) in a pool are freed, nor does it free completely unused pools either. In other words, the above deallocation is only a rough example.

Speed-wise, the above seem to be at least an order of magnitude slower than GLIBC malloc()/free() on my machine. Here is a simple test to check for linear allocation - semirandom deallocation pattern:

/* Make sure this is prime wrt. 727 */
#define POINTERS 1000000

int main(void)
{
    void **array;
    size_t i;

    fprintf(stderr, "Allocating an array of %lu pointers: ", (unsigned long)POINTERS);
    fflush(stderr);
    array = malloc((size_t)POINTERS * sizeof array[0]);
    if (array == NULL) {
        fprintf(stderr, "Failed.\n");
        return EXIT_FAILURE;
    }
    fprintf(stderr, "Done.\n\n");
    fprintf(stderr, "Allocating pointers in varying sizes.. ");
    fflush(stderr);
    for (i = 0; i < POINTERS; i++) {
        const size_t n = 1 + ((i * 727) % (SMALL_LIMIT - 1));
        if (!(array[i] = small_alloc(n))) {
            if (errno == EDOM)
                fprintf(stderr, "Failed at %lu; corrupted list.\n", (unsigned long)i + 1UL);
            else
                fprintf(stderr, "Failed at %lu: %s.\n", (unsigned long)i + 1UL, strerror(errno));
            return EXIT_FAILURE;
        }
    }

    fprintf(stderr, "Done.\n\n");
    fprintf(stderr, "Deallocating pointers in a mixed order.. ");
    fflush(stderr);

    for (i = 0; i < POINTERS; i++) {
        const size_t p = (i * 727) % POINTERS;
        if (small_free(array[p])) {
            if (errno == EDOM)
                fprintf(stderr, "Failed at %lu: corrupted list.\n", (unsigned long)i + 1UL);
            else
                fprintf(stderr, "Failed at %lu: %s.\n", (unsigned long)i + 1UL, strerror(errno));
            return EXIT_FAILURE;
        }
    }

    fprintf(stderr, "Done.\n\n");
    fprintf(stderr, "Deallocating the pointer array.. ");
    fflush(stderr);

    free(array);

    fprintf(stderr, "Done.\n\n");
    fflush(stderr);

    return EXIT_SUCCESS;
}

The true strength of a pool-based allocator, in my opinion, is that you can release entire pools at once. With that in mind, perhaps your workload could use a specialized allocator when building your structures, with a compactor phase (capable of adjusting pointers) run at least at the end, and possibly during the construction too if a sufficient number of deletions have been done. That approach would allow you to release the temporarily needed memory back to the OS. (Without compaction, most pools are likely to have at least one allocation left, making it impossible to release it.)

I don't think this is a good answer at all, but a better answer would require more details, specifically about the data structures stored, and the allocation/deallocation pattern that occurs in practice. In the absence of those, I hope this gives at least some ideas on how to proceed.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
1

It's not necessarily slower. If the blocks are fixed sizes (or a limited range of sizes) or you in fact allocate and unallocate in a logical sequence (FIFO / FILO) you can often improve performance and memory management by handling a 'pool'.

There is a boost library that implements which may or may not suit your needs.

http://www.boost.org/doc/libs/1_57_0/libs/pool/doc/html/boost_pool/pool/interfaces.html

Alternatively do it yourself - allocate big chunks of memory and carve it up yourself.

Please note it may not be just slower but may simply fail. Many aligned machines will 'go haywire' when asked to load out of alignment addresses For example, actually load the rounded down address which will contain random data before the 'correct' value. There's definitely no obligation on compilers to 'reconstruct' values and I would say it's more common that they don't than do.

So for portability you may need to use memcpy() to move out of alignment data into alignment before use. That doesn't necessarily have all the overhead you think as some compilers inline memcpy().

So pool managed allocation can often be faster (potentially much faster) but memcpy() may be slower (though maybe not so much slower).

It's swings and roundabouts.

Persixty
  • 8,165
  • 2
  • 13
  • 35