Understanding RAM speed using malloc, memtest and dd

Question

Playing around with new hardware, I wrote a piece of C code to test RAM speed and disk speed. Essentially it's 3 lines that write 5 GB in RAM and write it to a file, around which I set some timers:

long long int AMOUNT = 5*1024*1024*1024l;
FILE *file_handle = fopen('test.bin', 'wb');
handle = malloc(AMOUNT);
memset(handle, 0, AMOUNT);
fwrite(handle, AMOUNT, 1, file_handle);

Then I tried it with dd using a ramdisk or tempfs:

mount -t tmpfs /mnt/temp1 /mnt/temp1
dd if=/dev/zero of=/mnt/temp1/test bs=1M

and, back to disk:

dd if=/mnt/temp1/test of=/home/user/test bs=1M

In the table below, my results, I also inserted the speed memtest 7.5. I don't understand the 9.0 and the big difference between memtest and the other numbers... Can anyone explain this?

You can blame software and OS overhead. I tried dd on my machine and got 4.4 GB/s which is many times slower that CPU is capable of (multithreaded memcpy). — Anty, Oct 05 '18 at 12:39
@Lundin I don't see where the asker wasn't "humble" or where they tossed "blame" around. — nullp0tr, Oct 05 '18 at 14:09
@nullp0tr Sorry, that was directed to Anty in the comment above mine. — Lundin, Oct 05 '18 at 14:10
I'm still wondering why the malloc is so much faster on the 6600U with 2133Mt/s mem — Niels, Oct 05 '18 at 17:58
Differences in compiler were ruled out... INT instead of LONG LONG INT was ruled out... just don't get it — Niels, Oct 05 '18 at 18:00
So, apparently, when using a different kernel (4.15.0-29 instead of 4.8.0-58) on the 6600U machine, the malloc speed is also almost 3 Gb/sec, like on the 7500T, which already had 4.15.0-36. So, it's something in the kernel. — Niels, Oct 05 '18 at 20:44

nullp0tr · Answer 1 · 2018-10-05T13:10:45.447

There's a lot of factors at play, and I'm not capable of iterating most of them nor understand them. But here's a little glimpse of some of the things happening in the background:

Virtual Memory

On most modern user systems you don't actually have direct access to RAM. You have multiple layers of indirection, one of them being Virtual Memory. VM is memory that your process accesses as if it was normal contiguous RAM, but which actually the underlying systems convert to the proper address in RAM. So accessing the physical memory with a virtual address is almost certainly not gonna provide you with the data you were looking for.

Virtual Memory also has layers. Modern processors include native support for Virtual Memory and it is often controller by a MMU near or on the same die as the processor.

A lot of OSs also have their own layer of virtual memory, that they then either translate to the MMU managed virtual memory on the processor or directly to physical RAM.

Just an example of how far the rabbit hole goes, Linux actually has lazy memory allocation. So when you first allocate memory it is not communicated to the CPU, but only kept saved in a kernel data structure, when you later-on access the memory, the CPU generates a Page Fault. The kernel's page fault handler then looks to see whether that memory was lazily allocated, and if so actually allocates it.

Kernel Space vs User Space

Userspace programs aren't allowed to modify physical memory directly, and in the case of *nixes they call System Calls to do that for them. System calls change the operating mode of the CPU, and is often a relatively slow operation.

Library Functions

Library functions like malloc have to actually do a lot of bookkeeping to make sure that when you call free on a pointer, you only free that part. But they also allocate in bulks. malloc on *nixes calls the syscall mmap to allocate a page. Subsequent malloc calls will continue to use that page, until you need more.

How does this relate to this question?

The above is only a glimpse of the things happening when you're working with memory, and so how you allocate the memory, in what quantity and what flags do you pass to the system change a lot of things, and can explain the discrepancies between the results.

Suggestion

Try running strace on those processes to see where they spend most of their time!

score 0 · Answer 2 · answered Oct 05 '18 at 13:16

0

Your whole experiment falls flat on the face, since you haven't realized how expressions and type promotions work in C.

5*1024*1024*1024l are 4 operands each with the type int. What type you store the result inside is irrelevant for how the calculation is carried out. They are carried out on int since the operands of each * are of type int.

On a mainstream 32/64 bit 2's compl. system, int can have a value up to 2^31 -1 = 2.14 billion. So this expression overflows and you invoke an undefined behavior bug. What happens from there on in your program isn't meaningful to discuss.

Change all integer constants to have a ULL suffix, then start over.

answered Oct 05 '18 at 13:16

Lundin

195,001
40
254
396

The last one has a 'l' suffix, which made the overflow error disappear, so I think I'm good. – Niels Oct 05 '18 at 13:57
@Niels Aah, that's why we write `L` and not `l`, the latter looks like digit `1` (one) in some fonts like Courier. Anyway, `*` associate left-to-right, so the prior calculations before `1024L` are carried out on `int` type and the final calculation 5,24 million * 1024 is calculated on `long`. Which is most often 4 bytes too, and not 8 bytes. So the bug is still there and the suffix didn't fix it. – Lundin Oct 05 '18 at 14:14
The speed differences using malloc stay the same when using 1024*1024*1024, which is within max_value(int) – Niels Oct 05 '18 at 14:21