4

I'm finding myself bandwidth-constrained in a parallel computing application, and I've profiled the program during execution. The critical data is expectedly in a contiguous line, but the memory dumps show it is always ending up entirely or mostly on 1 stick of RAM. This would be fine if we had 60GB/s RAM, but we don't.

Someone has to have solved the problem of multi-channel allocations. It's way too common a problem in HPC.

patrickjp93
  • 399
  • 4
  • 20
  • There's nothing in the 1400 pages that make up the current C++ standard that refers to memory sticks or specific memory allocations from individual RAM sticks. If anything, this would be operating system-specific functionality; but I regret to inform you that there is no single uniform API for this kind of functionality that works on every operating system used in the world. It should've been obvious to you that without specifying the operating system, your question cannot be answered. – Sam Varshavchik Oct 13 '16 at 22:36
  • 2
    And how do you suppose the C++ compiler is going to know at compilation time how many stick of RAM the machine executing the result is going to have? – Adrian Colomitchi Oct 13 '16 at 22:36
  • What kind of hardware layout do you have -- One processor? Multiple processors? Multiple threads? NUMA? And what's your operating system? – Cameron Oct 13 '16 at 22:36
  • Adrian, I'm already querying the system for maximum thread count and memory layout (std::concurrency and std::experimental provide this functionality in OS-agnostic libraries). Sam, by your logic there should be no library that works the same way on every OS. We know that is not the case. I am on Linux, but cut the cynicism here. Cameron, let's start with 1 multicore processor on its own dedicated memory channels under Linux. The same techniques that work here will work for NUMA later. I know access patterns well enough. Let's focus on allocation. – patrickjp93 Oct 13 '16 at 22:52
  • 1
    Physical RAM allocation is under the control of the OS, the application has no control over that. If the system has any swapping occurring it might even change over time. You're assuming that there is a library which gives access to OS level controls for this, but I'd be very surprised if that's the case. – Mark Ransom Oct 13 '16 at 22:57
  • Mark, there has to be, because otherwise you couldn't write OS kernel code at all. – patrickjp93 Oct 13 '16 at 23:01
  • Patrick, you missed Mark said. The OS can do this. The OS does not necessarily expose the ability for programs to do this. – user4581301 Oct 13 '16 at 23:25
  • I didn't miss it. I'm merely saying since C/C++ code can do this for the OS, there must be a way to allocate memory without system calls. I don't care how hairy it is as long as it exists. I suppose I could go to the trouble of wrapping raw assembly, but this should be a solved problem by now. – patrickjp93 Oct 13 '16 at 23:59
  • 2
    Kernel code runs at a lower level then application code. A kernel has direct access to the hardware.. In modern computers, the OS blocks user applications from directly accessing hardware. – Remy Lebeau Oct 14 '16 at 00:48
  • That's true for busses to drives, networking interfaces, PCIe devices, and chipsets, but it literally can't be true for memory. If I launch a program that does nothing but increment a register and LEA using it, I can iterate over raw memory and load it into registers. it'll do it just fine with direct memory access until it runs into the secure area with the OS data and an exception is thrown. If what you claimed was true, the CPU would have to make a system call every time it wanted to load a new cache line, and this simply is not the case. The question is how to tell the OS mem is off-limit – patrickjp93 Oct 14 '16 at 01:01
  • Expanding on the previous... The only question is how to tell the OS at that point that those two directly stored memory regions are in use by the program at hand to ensure there is maximum bandwidth potential. – patrickjp93 Oct 14 '16 at 01:02
  • "...LEA using it, I can iterate over raw memory and load it into registers." Not really. Say `malloc` returns the address `0xbaff1ed0`. You can load/store at **what you think** is addr `0xbaff1ed0`, but due to virtual memory and the MMU, the actual physical memory could be anywhere in memory at offset `0x00000ed0` from any 4 KB boundary (assuming your virtual memory pages are 4 KB). And it can move around. Normally it wouldn't, but if it gets temporarily swapped out to disk, when it's paged back in it will likely be in a different physical location, and your application won't even know. – phonetagger Oct 14 '16 at 17:55
  • I just said I'm not using malloc. Malloc makes a system call, and if you look at the assembly generated by Malloc, you will see that. If I just write in-line assembly that iterates over memory, it completely circumvents the operating system, at least until it runs into kernel space and the CPU throws an exception. Now, I can write bytes wherever as well, but if I can't know that the memory is allocated by another process or can't tell the OS this memory is now allocated to my program, it's not useful to me. In a real HPC environment I can shut off paging completely, and that has benefits. – patrickjp93 Oct 15 '16 at 04:50
  • 2
    The OS provides the mapping from physical memory to the *address space* of an application. You may *think* you're incrementing through physical memory, but that's not the case at all - you're incrementing through the mapped address space. Switching to assembly doesn't change anything. Unless the processor is in kernel mode, it simply doesn't have access to the mapping registers. – Mark Ransom Oct 16 '16 at 03:26
  • Patrick? JP? Whoever you are... when commenters comment under your post, you get notified. When you respond to them, it's customary to start out with, for example in response to me, "@phonetagger" (without the quotes). Then Stack Overflow notifies me that someone responded to a comment of mine. Otherwise the only way I'd know is to go check everywhere I've left a comment. As far as using or not using `malloc` goes, I'm sorry I included that in my example. It doesn't matter how you initially got a valid address, the issue is the same. You cannot circumvent the OS merely by using assembly code. – phonetagger Oct 17 '16 at 13:31
  • Multi channel memory is not simply multiple sticks of ram. It depends on the memory controller. If you have a dual-channel controller, then even when you plug in 16 sticks, it's still 2 channel bandwidth. Usually multi-channel memory uses "word-interleaved" method, "word" being 64 bit for desktop memory. Really, memory allocated to lower address does not mean it is only in the first slot. – user3528438 Oct 17 '16 at 14:10

0 Answers0