Why is memory allocation for processes slow and can it be faster?

Question

I relatively familiar how virtual memory works. All process memory is divided into pages and every page of the virtual memory maps to a page in real memory or a page in swap file or it can be a new page which means that physical page is still not allocated. OS maps new pages to real memory on demand, not when an application asks for memory with malloc, but only when an application actually accesses every page from allocated memory. But I still have questions.

I've noticed this when was profiling my app with linux perf tool.

There is about 20% of time took kernel functions: clear_page_orig, __do_page_fault and get_page_from_free_list. This is much more than I expected for this task and I've done some research.

Let start with some small example:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define SIZE 1 * 1024 * 1024

int main(int argc, char *argv[]) {
  int i;
  int sum = 0;
  int *p = (int *) malloc(SIZE);
  for (i = 0; i < 10000; i ++) {
    memset(p, 0, SIZE);
    sum += p[512];
  }
  free(p);
  printf("sum %d\n", sum);
  return 0;
}

Let assume that memset is just some memory bound processing. In this case, we allocate a small chunk of memory once and reuse it again and again. I'll run this program like this:

$ gcc -O1 ./mem.c && time ./a.out

-O1 required because clang with -O2 entirely eliminates the loop and calculates the value instantly.

The results are: user: 0.520s, sys: 0.008s. According to perf, 99% of this time is in memset from libc. So for this case, the write performance is about 20 Gigabytes/s which is more than theoretical performance 12.5 Gb/s for my memory. Looks like this is due to L3 CPU cache.

Let change test and start to allocate memory in loop (I'll not repeat same parts of code):

#define SIZE 1 * 1024 * 1024
for (i = 0; i < 10000; i ++) {
  int *p = (int *) malloc(SIZE);
  memset(p, 0, SIZE);
  free(p);
}

The result is exactly the same. I believe that free doesn't actually free memory for OS, it just put it in some free list within the process. And malloc on next iteration just get exactly the same memory block. That is why there is no noticeable difference.

Let start increase SIZE from 1 Megabyte. Execution time will grow little by little and will be saturated near 10 Megabytes (there is no difference for me between 10 and 20 megabytes).

#define SIZE 10 * 1024 * 1024
for (i = 0; i < 1000; i ++) {
  int *p = (int *) malloc(SIZE);
  memset(p, 0, SIZE);
  free(p);
}

Time shows: user: 1.184s, sys: 0.004s. perf still reports than 99% of the time is in memset, but throughput is about 8.3 Gb/s. At that point, I understand what is going on, more or less.

If we will continue to increase memory block size, at some point (for me on 35 Mb) execution time will increase dramatically: user: 0.724s, sys: 3.300s.

#define SIZE 40 * 1024 * 1024
for (i = 0; i < 250; i ++) {
  int *p = (int *) malloc(SIZE);
  memset(p, 0, SIZE);
  free(p);
}

According to perf, memset will consume only 18% of a time.

Obviously, memory is allocated from OS and freed on each step. As I mentioned before, OS should clear each allocated page before use. So 27.3% of clear_page_orig doesn't look extraordinary: it is just 4s * 0.273 ≈ 1.1 sec for clear mem — the same we get in the third example. memset took 17.9%, which leads to ≈ 700 msec, which is normal due to the memory already in L3 cache after clear_page_orig (first and second example).

What I can't understand — why the last case is 2 times slower than just memset for memory + memset for L3 cache? Can I do something with it?

The results are reproducible (with small differences) on native Mac OS, Ubuntu under Vmware and Amazon c4.large instance.

Also, I think there is a room for optimization on two levels:

on OS level. If OS knows that it returns a page to the same application which it was belongs to previously, it can not clear it.
on CPU level. If CPU knows that the page used to be free, it can do not clear the page in memory. It can just clear it in the cache and move it to the memory only after some processing in the cache.

You don't have a specific question but I don't see anything unusual or unexpected in your profile. All you are doing is generating page faults with memset. — user3344003, Oct 10 '16 at 15:25
I have the specific question. "Why the last case is 2 times slower than time for (memset for memory + memset for L3 cache)". I believe that all other manipulations with pages shouldn't be so expensive. — homm, Oct 10 '16 at 18:54

Joe Damato · Accepted Answer · 2016-10-11T01:45:47.537

What's happening here is a bit complicated as it involves a few different systems, but it is definitely not related to the context switch cost; your program makes very few system calls (verify this by using strace).

First it's important to understand some basic principles about the way malloc implementations generally work:

Most malloc implementations obtain a bunch of memory from the OS by calling sbrk or mmap during initialization. The amount of memory obtained can be adjusted in some malloc implementations. Once the memory is obtained, it is typically cut into different size classes and arranged in a data structure so that when a program requests memory with e.g., malloc(123), the malloc implementation can quickly find a piece of memory matching those requirements.
When you call free, memory is returned to a free list and can be re-used on subsequent calls to malloc. Some malloc implementations allow you to tune precisely how this works.
When you allocate large chunks of memory, most malloc implementations will simply pass calls for huge amounts of memory straight to the mmap system call, which allocates "pages" of memory at at time. For most systems, 1 page of memory is 4096 bytes.
Related, most OS's will attempt to clear pages of memory before handing them out to processes which have requested memory via mmap or sbrk. This is why you see calls to clear_page_orig in the perf output. This function is attempting to write 0s to pages of memory.

Now, these principles intersect with another idea which has many names but is commonly referred to as "demand paging." What "demand paging" means is that when a user program requests a chunk of memory from the OS (say by calling mmap), the memory is allocated in the virtual address space of the process, but there is no physical RAM backing that memory yet.

Here's an outline of the demand paging process:

A program called mmap to allocate 500MB of RAM.
The kernel maps a region of addresses in the process' address space for the 500 MB of RAM requested. It maps a "few" (OS dependent) pages (4096 bytes each, usually) of physical RAM to back those virtual addresses.
The user program begins accessing memory by writing to it.
Eventually, the user program will access an address that is valid, but has no physical RAM backing it.
This generates a page fault on the CPU.
The kernel responds to the page fault by seeing that the process is accessing a valid address, but one without physical RAM backing it.
The kernel then finds RAM to allocate to that region. This can be slow if memory for other processes needs to be written to disk, first ("swapped out").

The most likely reason why you are seeing a performance degradation on the last case is because:

Your kernel has run out of zero'd page of memory that can be distributed to fulfill your request for 40 MB, thus it is zeroing memory over and over as evidenced by your perf output.
You are generating pagefaults as you access memory that is not mapped in yet. Since you are accessing 40mb instead of 10mb, you will generate more page faults as there are more pages of memory that need to be mapped in.
As another answer pointed out, memset is O(n) meaning the more memory you need to write to, the longer it will take.
Less likely, since 40mb is not much RAM these days, but check the amount of free memory on your system just to be sure you have enough RAM.

If your application is extremely performance sensitive you can instead call mmap directly and:

pass the MAP_POPULATE flag which will cause all the page faults to happen up-front and map all the physical memory in -- then you won't be paying the cost to page fault on access.
pass the MAP_UNINITIALIZED flag which will attempt to avoid zeroing pages of memory prior to distributing them to your process. Note that using this flag is a security concern and should not be used unless you fully understand the implications of using this option. It is possible that process could be issued pages of memory that were used by other unrelated processes for storing sensitive information. Also note that your kernel must be compiled to allow this option. Most kernels (like the AWS Linux kernel) do not come with this option enabled by default. You should almost certainly not use this option.

I would caution you that this level of optimization is almost always a mistake; most applications have much lower hanging fruit for optimization that does not involve optimizing the page fault cost. In a real world application, I'd recommend:

Avoiding the use of memset on large blocks of memory unless it is truly necessary. Most of the time, zeroing memory prior to re-use by the same process is not necessary.
Avoiding allocating and free the same chunks of memory over and over; perhaps you can simply allocate a large block up front and re-use it as needed later.
Using the MAP_POPULATE flag above if the cost of the page faults on access is truly detrimental to performance (unlikely).

Please leave comments if you have any questions and I'll be happy to edit this post an expand on this a bit if needed.

Charlie Martin · Answer 2 · 2016-10-10T00:10:28.933

I'm not certain, but I'm willing to bet the cost of context switching from user mode to kernel, and back again, is dominating everything else. memset also takes significant time -- remember it's going to be O(n).

Update

I believe that free doesn't actually free memory for OS, it just put it in some free list within the process. And malloc on next iteration just get exactly the same memory block. That is why there is no noticeable difference.

This is, in principle, correct. The classic malloc implementation allocates memory on a singly-linked list; free simply sets a flag saying the allocation is no longer used. As time goes on, malloc reallocates the first time it can find a free block big enough. This works well enough, but can lead to fragmentation.

There are a number of slicker implementations now, see this Wikipedia article.

Why is memory allocation for processes slow and can it be faster?

2 Answers2