Poor memcpy performance in user space for mmap'ed physical memory in Linux

Question

Of 192GB RAM installed on my computer, I have 188GB RAM above 4GB (at hardware address 0x100000000) reserved by the Linux kernel at boot time (mem=4G memmap=188G$4G). A data acquisition kernel modules accumulates data into this large area used as a ring buffer using DMA. A user space application mmap's this ring buffer into user space, then copies blocks from the ring buffer at the current location for processing once they are ready.

Copying these 16MB blocks from the mmap'ed area using memcpy does not perform as I expected. It appears that the performance depends on the size of the memory reserved at boot time (and later mmap'ed into user space). http://www.wurmsdobler.org/files/resmem.zip contains the source code for a kernel module which does implements the mmap file operation:

module_param(resmem_hwaddr, ulong, S_IRUSR);
module_param(resmem_length, ulong, S_IRUSR);
//...
static int resmem_mmap(struct file *filp, struct vm_area_struct *vma) {
remap_pfn_range(vma, vma->vm_start,
    resmem_hwaddr >> PAGE_SHIFT,
    resmem_length, vma->vm_page_prot);
return 0;
}

and a test application, which does in essence (with the checks removed):

#define BLOCKSIZE ((size_t)16*1024*1024)
int resMemFd = ::open(RESMEM_DEV, O_RDWR | O_SYNC);
unsigned long resMemLength = 0;
::ioctl(resMemFd, RESMEM_IOC_LENGTH, &resMemLength);
void* resMemBase = ::mmap(0, resMemLength, PROT_READ | PROT_WRITE, MAP_SHARED, resMemFd, 4096);
char* source = ((char*)resMemBase) + RESMEM_HEADER_SIZE;    
char* destination = new char[BLOCKSIZE];
struct timeval start, end;
gettimeofday(&start, NULL);
memcpy(destination, source, BLOCKSIZE);
gettimeofday(&end, NULL);
float time = (end.tv_sec - start.tv_sec)*1000.0f + (end.tv_usec - start.tv_usec)/1000.0f;
    std::cout << "memcpy from mmap'ed to malloc'ed: " << time << "ms (" << BLOCKSIZE/1000.0f/time << "MB/s)" << std::endl;

I have carried out memcpy tests of a 16MB data block for the different sizes of reserved RAM (resmem_length) on Ubuntu 10.04.4, Linux 2.6.32, on a SuperMicro 1026GT-TF-FM109:

|      |         1GB           |          4GB           |         16GB           |        64GB            |        128GB            |         188GB
|run 1 | 9.274ms (1809.06MB/s) | 11.503ms (1458.51MB/s) | 11.333ms (1480.39MB/s) |  9.326ms (1798.97MB/s) | 213.892ms (  78.43MB/s) | 206.476ms (  81.25MB/s)
|run 2 | 4.255ms (3942.94MB/s) |  4.249ms (3948.51MB/s) |  4.257ms (3941.09MB/s) |  4.298ms (3903.49MB/s) | 208.269ms (  80.55MB/s) | 200.627ms (  83.62MB/s)

My observations are:

From the first to the second run, memcpy from mmap'ed to malloc'ed seems to benefit that the contents might already be cached somewhere.
There is a significant performance degradation from >64GB, which can be noticed both when using a memcpy.

I would like to understand why that so is. Perhaps somebody in the Linux kernel developers group thought: 64GB should be enough for anybody (does this ring a bell?)

Kind regards, peter

score 2 · Answer 1 · answered Apr 24 '12 at 14:48

Based on feedback from SuperMicro, the performance degradation is due to NUMA, non-uniform memory access. The SuperMicro 1026GT-TF-FM109 uses the X8DTG-DF motherboard with one Intel 5520 Tylersburg chipset at its heart, connected to two Intel Xeon E5620 CPUs, each of which has 96GB RAM attached.

If I lock my application to CPU0, I can observe different memcpy speeds depending on what memory area was reserved and consequently mmap'ed. If the reserved memory area is off-CPU, then mmap struggles for some time to do its work, and any subsequent memcpy to and from the "remote" area consumes more time (data block size = 16MB):

resmem=64G$4G   (inside CPU0 realm):   3949MB/s  
resmem=64G$96G  (outside CPU0 realm):    82MB/s  
resmem=64G$128G (outside CPU0 realm):  3948MB/s
resmem=92G$4G   (inside CPU0 realm):   3966MB/s            
resmem=92G$100G (outside CPU0 realm):    57MB/s

It nearly makes sense. Only the third case, 64G$128, which means the uppermost 64GB also yield good results. This contradicts somehow the theory.

Regards, peter

If each NUMA domain has 96 GiB of RAM then the "64g$128G" test would be half in the first NUMA domain and half in the second NUMA domain. The last test ("92G$100G") is likely to be excessively dangerous - firmware typically reserves some RAM for various things and RAM at the top may be used for things like SMM state save areas and not free/usable RAM that random software can trash. — Brendan, May 22 '15 at 01:11

score 1 · Answer 2 · answered Apr 19 '12 at 22:35

1

Your CPU probably doesn't have enough cache to deal with it efficiently. Either use lower memory, or get a CPU with a bigger cache.

answered Apr 19 '12 at 22:35

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Hello Ignacio, you may be right. The computer is fitted with two Intel Xeon E5620 2.4GHz quad core with 12MB L3 cache each and 1066MHz memory speed. – PeterW Apr 20 '12 at 09:05
My simplistic view is that for the first read operation the contents in RAM would be cached, and the second request would be served directly from cache, as long as the amount fits into the cache. I would have thought that the amount of data transferred would have an effect on the memcpy, in my case < 12MB, but not the total size of memory fitted or where in RAM the data is. – PeterW Apr 20 '12 at 09:12
Further tests have shown that the same performance degradation shows for smaller data blocks, e.g. 1MB. I only seems to depend on the amount of memory reserved at boot time, i.e. not more that 64GB. – PeterW Apr 20 '12 at 09:25

Poor memcpy performance in user space for mmap'ed physical memory in Linux

2 Answers2

Linked