Why does jemalloc take more time to allocate 4096 bytes of memory than other SMALL memory?

Question

In the process of testing the performance of jemalloc-5.2.0 to allocate small_class memory, it was found that the memory allocation time of 4096 bytes was significantly higher than that of other small class memory. Is there any special handling for 4096 bytes memory allocation in jemalloc? Or is there any other reason?

Test results:

Use google benchmark with multithreaded test (24threads).

Run on (32 X 3400 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 256 KiB (x16)
  L3 Unified 20480 KiB (x2)
Load Average: 15.72, 14.21, 14.26
-----------------------------------------------------------------------------------------
Benchmark                                               Time             CPU   Iterations
-----------------------------------------------------------------------------------------
BM_SomeFunction/1792/iterations:500/threads:24      0.095 ms         2.12 ms        12000
BM_SomeFunction/1856/iterations:500/threads:24      0.175 ms         4.10 ms        12000
BM_SomeFunction/1920/iterations:500/threads:24      0.178 ms         4.13 ms        12000
BM_SomeFunction/1984/iterations:500/threads:24      0.177 ms         4.14 ms        12000
BM_SomeFunction/2048/iterations:500/threads:24      0.181 ms         4.18 ms        12000
BM_SomeFunction/2048/iterations:500/threads:24      0.177 ms         4.16 ms        12000
BM_SomeFunction/2176/iterations:500/threads:24      0.116 ms         2.67 ms        12000
BM_SomeFunction/2304/iterations:500/threads:24      0.113 ms         2.64 ms        12000
BM_SomeFunction/2432/iterations:500/threads:24      0.118 ms         2.75 ms        12000
BM_SomeFunction/2560/iterations:500/threads:24      0.113 ms         2.65 ms        12000
BM_SomeFunction/2560/iterations:500/threads:24      0.114 ms         2.68 ms        12000
BM_SomeFunction/2688/iterations:500/threads:24      0.133 ms         3.13 ms        12000
BM_SomeFunction/2816/iterations:500/threads:24      0.132 ms         3.08 ms        12000
BM_SomeFunction/2944/iterations:500/threads:24      0.131 ms         3.09 ms        12000
BM_SomeFunction/3072/iterations:500/threads:24      0.132 ms         3.10 ms        12000
BM_SomeFunction/3072/iterations:500/threads:24      0.132 ms         3.11 ms        12000
BM_SomeFunction/3200/iterations:500/threads:24      0.117 ms         2.72 ms        12000
BM_SomeFunction/3328/iterations:500/threads:24      0.113 ms         2.66 ms        12000
BM_SomeFunction/3456/iterations:500/threads:24      0.111 ms         2.61 ms        12000
BM_SomeFunction/3584/iterations:500/threads:24      0.112 ms         2.63 ms        12000
BM_SomeFunction/3584/iterations:500/threads:24      0.112 ms         2.63 ms        12000
BM_SomeFunction/3712/iterations:500/threads:24      0.271 ms         6.35 ms        12000
BM_SomeFunction/3840/iterations:500/threads:24      0.270 ms         6.35 ms        12000
BM_SomeFunction/3968/iterations:500/threads:24      0.274 ms         6.42 ms        12000
BM_SomeFunction/4096/iterations:500/threads:24      0.276 ms         6.49 ms        12000
BM_SomeFunction/4096/iterations:500/threads:24      0.273 ms         6.41 ms        12000
BM_SomeFunction/4352/iterations:500/threads:24      0.151 ms         3.53 ms        12000
BM_SomeFunction/4608/iterations:500/threads:24      0.146 ms         3.45 ms        12000
BM_SomeFunction/4864/iterations:500/threads:24      0.142 ms         3.36 ms        12000
BM_SomeFunction/5120/iterations:500/threads:24      0.144 ms         3.40 ms        12000
BM_SomeFunction/5120/iterations:500/threads:24      0.146 ms         3.40 ms        12000
BM_SomeFunction/5376/iterations:500/threads:24      0.196 ms         4.57 ms        12000
BM_SomeFunction/5632/iterations:500/threads:24      0.187 ms         4.39 ms        12000
BM_SomeFunction/5888/iterations:500/threads:24      0.191 ms         4.47 ms        12000
BM_SomeFunction/6144/iterations:500/threads:24      0.188 ms         4.39 ms        12000

test report:

BM_SomeFunction/1792/iterations:500/threads:24      0.095 ms         2.12 ms        12000

means allocating 1792 byte of memory consumes 2.12 ms CPU time.

Test code

#include "benchmark/benchmark.h"
#include "jemalloc/jemalloc.h"

static size_t kBatchSize = 10000;

static void alloc_mem_n(size_t size) {
    std::vector<char*> kVec(kBatchSize, 0);
    for (int i = 0; i < kBatchSize; ++i) {
        auto p = new char[size];
        p[0] = i;
        benchmark::ClobberMemory();
        kVec[i] = p;
    }
    for (auto &p : kVec) {
        delete p;
        p = nullptr;
    }
}

static void BM_SomeFunction(benchmark::State& state) {
    for (auto _ : state) {
        alloc_mem_n(state.range(0));
    }
}


BENCHMARK(BM_SomeFunction)
    ->Unit(benchmark::kMillisecond)
    ->Iterations(500)
    ->Threads(24)
    ->DenseRange(1792, 2048, 64)
    ->DenseRange(2048, 2560, 128)
    ->DenseRange(2560, 3072, 128)
    ->DenseRange(3072, 3584, 128)
    ->DenseRange(3584, 4096, 128)
    ->DenseRange(4096, 5120, 256)
    ->DenseRange(5120, 6144, 256);

BENCHMARK_MAIN();

I guess that to allocate 4096 bytes, the malloc will allocate a new memory page (which are typically 4 kiB) into virtual memory. This is more flexible (with no fragmentation) but certainly slower. — prapin, Apr 18 '21 at 09:46
If my guess is correct, you should see the same performance for multiple of 4096 bytes like 8192 or 16384. — prapin, Apr 18 '21 at 09:49
Allocating multiples of 4k memory (8kiB, 12kiB) does have a similar phenomenon, but the performance of allocating 4K memory is worse than that of allocating 5K memory, which does not make sense. — HsuehYH, Apr 19 '21 at 06:33
It can make sense. `jemalloc` certainly manages a heap that is preallocated and probably a few MB big. To allocate some random size, it will try to find a hole inside it big enough, mark it as allocated and return. This is in essence how an allocator works. But I am pretty sure that `jemalloc` does something completely different for multiple of 4 kiB. In these cases, the best to avoid fragmentation is to allocate *new pages* from the OS *outside* of the main heap. As this implies setting up CPU hardware, it can well be slower than regular hole lookup. — prapin, Apr 19 '21 at 20:25

score 0 · Answer 1 · answered Nov 16 '22 at 08:27

Jemalloc uses slab for allocating small sizes, and size of slab is equal to the least common multiple of the allocated size and page size.

In this case, allocated_size == page_size == slab_size == 4096, which means that a slab of 4096 bytes can satisfy only one allocation of 4096 bytes.

Why does jemalloc take more time to allocate 4096 bytes of memory than other SMALL memory?

Test results:

Test code

1 Answers1