Scalable Memory Allocation using INTEL TBB

Question

I want to allocate about 40 GB on RAM. My first try was:

#include <iostream>
#include <ctime>

int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];  // 3GB/s  40GB / 13.7 s
    unsigned long long i = 0;
    const clock_t begintime = clock(); 
    for (i = 0; i < ARRAYSIZE; ++i){
    myBuff[i] = 0;
    }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
    std::cin.get();
    delete [] myBuff;
    return 0;
}

The memory write speed was about 3 GB/s that was not satisfactory for my high performance system.

So I tried Intel Cilk Plus as below:

    /*
    nworkers =  5;  8.5 s ==> 4.7 GB/s
    nworkers =  8;  8.2 s ==> 4.8 GB/s
    nworkers =  10; 9   s ==> 4.5 GB/s
    nworkers =  32; 15  s ==> 2.6 GB/s
    */

#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>

int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
    if (0 != __cilkrts_set_param("nworkers", "32")){
    std::cout << "Error" << std::endl;
    }
    const clock_t begintime = clock();
    cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
    myBuff[j] = 0;
    }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
    std::cin.get();
    delete [] myBuff;
    return 0;
}

The results are commented above the code. As it can be seen, there is speed up for nworkers = 8. But the larger nworkers, the slower allocating. I thought maybe it was due to locking by threads. So I tried scalable allocator provided by Intel TBB as:

#include "tbb\task_scheduler_init.h"
#include "tbb\blocked_range.h"
#include "tbb\parallel_for.h"
#include "tbb\scalable_allocator.h"
#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>
// No retry loop because we assume that scalable_malloc does
// all it takes to allocate the memory, so calling it repeatedly
// will not improve the situation at all
//
// No use of std::new_handler because it cannot be done in portable
// and thread-safe way (see sidebar)
//
// We throw std::bad_alloc() when scalable_malloc returns NULL
//(we return NULL if it is a no-throw implementation)

void* operator new (size_t size) throw (std::bad_alloc)
{
    if (size == 0) size = 1;
    if (void* ptr = scalable_malloc(size))
        return ptr;
    throw std::bad_alloc();
}

void* operator new[](size_t size) throw (std::bad_alloc)
{
    return operator new (size);
}

void* operator new (size_t size, const std::nothrow_t&) throw ()
{
    if (size == 0) size = 1;
    if (void* ptr = scalable_malloc(size))
        return ptr;
    return NULL;
}

void* operator new[](size_t size, const std::nothrow_t&) throw ()
{
    return operator new (size, std::nothrow);
}

void operator delete (void* ptr) throw ()
{
    if (ptr != 0) scalable_free(ptr);
}

void operator delete[](void* ptr) throw ()
{
    operator delete (ptr);
}

void operator delete (void* ptr, const std::nothrow_t&) throw ()
{
    if (ptr != 0) scalable_free(ptr);
}

void operator delete[](void* ptr, const std::nothrow_t&) throw ()
{
    operator delete (ptr, std::nothrow);
}



int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    tbb::task_scheduler_init tbb_init;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
    if (0 != __cilkrts_set_param("nworkers", "10")){
        std::cout << "Error" << std::endl;
    }
    const clock_t begintime = clock();
    cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
        myBuff[j] = 0;
        }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;

    std::cin.get();
    delete [] myBuff;
    return 0;
}

(Above code is adapted from Intel TBB book by James Reinders, O'REILLY) But results are almost identical to the previous try. I set TBB_VERSION environment variable to see if I really use Scalable_malloc and the got information is in this picture (nworkers = 32):

https://www.dropbox.com/s/y1vril3f19mkf66/TBB_Info.png?dl=0

I am willing to know what is wrong whit my code. I expect memory write speed to be at least about 40 GB/s.
How should I use scalable allocator correctly?
Can somebody please present a simple verified example of using scalable allocator from INTEL TBB?

Environment: Intel Xeon CPU E5-2690 0 @ 2.90 GHz (2 processors), 224 GB RAM (2 * 7 * 16 GB) DDR3 1600 MHz, Windows server 2008 R2 Datacenter, Microsoft visual studio 2013 and Intel C++ compiler 2017.

You say that the performance isn't satisfactory. What makes you think that you should be able to write at least 40GB/s ? — Sean, Oct 02 '17 at 10:05
According to system configuration. of course, The memory write speed is about 50 GB/s after initializing. — IndustProg, Oct 02 '17 at 10:11
You seem to use it correctly, except that you never free allocated memory. But your question is actually messed up, because your are switching from correct allocation use to memory write speed expectations all of it sudden. And what is worse, you are trying to measure it by running a pointless array filling loop that would be definitely [eliminated by compilator in release mode](https://godbolt.org/g/Sh2gCP). — user7860670, Oct 02 '17 at 10:32
@gnts B can stand for bytes or bits. If you are measuring 5 gigabits on a system speced for 50 gigabytes, you are at 80% of theoretical capacity, for example. I am unable to determine what units Intel uses in their marketing. — Yakk - Adam Nevraumont, Oct 02 '17 at 10:53
If you had enabled the optimizer, the compiler would have thrown away the loop in your first example. Enable the optimizer! — , Oct 02 '17 at 11:08
@VTT, Thanks, I edited my code to free allocated memory. I think when I use scalable memory allocation correctly, I will get benefits from multi-threading. What method do benchmark tools such as memTest86 employ to report memory bandwidth? it reports high memory bandwidth. What is your comment when I get about 50 GB/s bandwidth after initializing array. I know about first touch in Windows. But please pay attention to memory bandwidth. — IndustProg, Oct 02 '17 at 12:41
@VTT, Another thing I want to mention is to have 7GB/s bandwidth in my personal computer by implementing same method. It is logical to expect higher bandwidth in higher system. — IndustProg, Oct 02 '17 at 12:48
@VTT, Sorry, I used SiSoftware Sandra. So what method do benchmark tools such as Sandra employ to report memory bandwidth? it reports high memory bandwidth. — IndustProg, Oct 02 '17 at 12:57
@Yakk, My theoretical RAM bandwidth is `1,600,000,000 * 64 * 4 = 51,200,000,000 bits per second`. Isn't it reasonable to have bandwidth about 40 gigabytes per second? — IndustProg, Oct 02 '17 at 13:01
@Yakk, I am really sorry, My theoretical RAM bandwidth is `1,600,000,000 * 64 * 4 = 409,600,000,000 bits per second = 51,200,000,000 BYTEs per second` — IndustProg, Oct 02 '17 at 13:18
@GntS Just make absolutely certain that is the case. I find quoted bandwidth sometimes uses one and sometimes the other. — Yakk - Adam Nevraumont, Oct 02 '17 at 13:21
Next question: what are your optimization flags? Because a really naive unoptimized `myBuff[j] = 0` first reads `myBuff[j]` into a register, then assigns it zero, then writes it out. And it does so with 16 bit chunks at a time. An optimized one might write 128 bytes (or more) without ever reading. That could make a massive difference. Heck because you are writing more than a page, it could just mark the entire page as cleared and write a few bytes to memory for every 4k of RAM cleared. — Yakk - Adam Nevraumont, Oct 02 '17 at 13:23
First of all, you should decide what are you actually testing. If it is allocator then there is no point to measure memory write speed, if it is memory write speed then there is no point to measure memory allocation performance, if it is both then let it be both. For reference you can measure performance of `VirtualAlloc` with `MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES` + `VirtualLock` that will allocate and zero out RAM. To measure write speed it would be better to use something like `_mm512_stream_si512` instead of naïve loop. — user7860670, Oct 02 '17 at 13:35
@VTT, I want to have some space on my RAM at maximum speed. As you know Windows is first touch in allocating memory that means the physical RAM is occupied when you write for the first time to your array and not when you allocate memory. So firstly I allocate memory in a scalable way using `INTEL TBB` and then try to write to it in parallel way by using `cilk_for`. My question is what is the fastest way to have some space on physical RAM? — IndustProg, Oct 02 '17 at 13:52
TBB calls its malloc scalable not because the memory it is allocating scales magically better but because the allocation itself is scalable while your test is all about bandwidth of the memory itself — Anton, Oct 02 '17 at 14:57
@Anton, Can you say please what part of my code should change? Did you mean that I should use `unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];` in a parallel way not only `myBuff[j] = 0;`? — IndustProg, Oct 03 '17 at 12:07
Please change the title of the question to smth like 'how to achieve maximum memory bandwidth' since it doesn't relate to TBB — Anton, Oct 03 '17 at 12:22
@Anton, So you meant I used scalable allocator from intel tbb correctly? And other issues which are NOT related to intel tbb cause poor performance? — IndustProg, Oct 03 '17 at 12:35
You are allocating memory once in main thread (certainly not in parallel as you claim elsewhere) in the most code-blowing way. Technically correct but makes no sense — Anton, Oct 03 '17 at 12:50
@Anton, The reason that I chose this title was because I thought I did NOT employ scalable allocator from INTEL TBB correctly! And I understand from your comment that it is possible. — IndustProg, Oct 03 '17 at 12:56

Sigi · Answer 1 · 2017-10-02T15:23:28.010

What to expect

From wikipedia: "DDR3-xxx denotes data transfer rate, and describes DDR chips, whereas PC3-xxxx denotes theoretical bandwidth (with the last two digits truncated), and is used to describe assembled DIMMs. Bandwidth is calculated by taking transfers per second and multiplying by eight. This is because DDR3 memory modules transfer data on a bus that is 64 data bits wide, and since a byte comprises 8 bits, this equates to 8 bytes of data per transfer."

So a single module DDR3-1600 is able of max 1600*8 = 12800 MB/s Having your system 4 channels (per processor) you should be able to reach:

12800 * 4 = 51200 MB/s - 51.2 GB/s, that's how is stated in the CPU specifications

And

You have two CPUs and 8 channels totally: you should be able to reach the double of it, working in parallel. However your system is a NUMA system - memory placement matters in this case...

But

You can put more than one memory bank per channel. When putting more modules in a channel you are reducing the available timings - a PC-1600 could behave as a PC-1333 or less for example - this is reported in motherboard specifications usually. Example here.

You have seven modules - you channels aren't populated equal... your bandwidth is limited by the slowest channel. It's recommended that the channels are populated equal to each other.

In case you are downclocked to 1333 you can expect: 1333 * 8 = 10666 MB/s per channel:

42 GB/s per CPU

However

Channels are distributed interleaved in addressing space, you are using all of them when zeroing blocks of memory. You can hit performance problems only when accessing memory with striped access.

Memory allocation is not memory access

TBB scalable allocations let MANY threads optimize memory allocations. That is, there is not a global lock when allocating and memory allocations won't be blocked limited by other threads activity. That's what often happens in OS allocators.

In your example you are not using many allocations at all, just one main thread. And you are trying to obtain max memory bandwidth. Memory access won't change when using different allocators.

Reading the comments I see that you want to optimize the memory access.

Optimize the memory access

Replace the zeroing loop with a single call to memset() and let the compiler optimize/inline it. - /O2 should be enough for it.

Rationale

The Intel compiler replaces many library calls (memset, memcpy,...) with optimized intrinsics/inlined calls. In this context - i.e., zeroing a large block of ram - inlining doesn't matter, but using optimized intrinsics is very important: it will use optimized version of streaming instructions: SSE4.2 / AVX

The basic libc memset will outperform any hand-written loop however. On Linux at least.

`In your example you are not using many allocations at all, just one main thread.` What did you mean? I used `Cilk_for` to initialize my array. So multi threads are employed to write to memory. Another thing to be mentioned, I get maximum memory bandwidth when I write to my array for second time (about 72 GB/s). IOW after that I really have some space on physical RAM (first touch concept), I get the maximum memory bandwidth. But at first time as mentioned before, slow executing! — IndustProg, Oct 03 '17 at 04:02
`In your example you are not using many allocations at all, just one main thread.` So what is your suggestion to not use only one main thread in allocating? Thanks. — IndustProg, Oct 03 '17 at 04:08
give a try to the 1st code replacing the for loop with memset(). And find out at which exact frequency your memory modules are timed to find the real bandwidth you are supposed to obtain. — Sigi, Oct 03 '17 at 12:24
I gave a try to the first code and used `memset()` instead of `for` loop. Time for 40 GB was 12.52s which means 3.3 GB/s. No significant speed up! — IndustProg, Oct 04 '17 at 10:15

score 1 · Answer 2 · answered Oct 02 '17 at 10:33

1

I can at least tell you why you don't get more than 25

Your CPU has a max RAM bandwidth of 51.2GB/s opccording to intel DDR3-1600 has a max bandwidth of 25.6GB/s according to wikipedia

That means that at least 2 RAM channels must be used to be able to expect more than 25. This, almost constantly if you want to approch 40-50.

For this, you would have to know how the OS splits memory address across the ram slots, and paralelize the loop in a way that the memory access in parallel are actually on 2 ram address that can be accessed in parralel. If the parallelization accesses at the 'same' time addresses that are near, they are likely to be on the same ram stick and only use one ram channel, thus limiting the rate to a theoric 25GB/s. You might even need something that is able to split the allocation in chunks at separate adresses in multiple ram slots, depending on how the ram addresses are parallelized on the slots.

answered Oct 02 '17 at 10:33

bartoli

173
8

1

I think four channels is used. I use `scalable_allocator` from INTEL TBB to allocate memory in parallel. Did you mean it is not sufficient? Should I do somethings with my OS? – IndustProg Oct 02 '17 at 12:52
Only if you can guarantee that you program will constantly access memory of at least 2 channels. It will depend how your hardware splits physical ram into memory address and how your program allocates/accesses your buffer. Having multiple ram channels does not magically double one ram module's speed. You get the bonus only if all channels can be accessed at once – bartoli Oct 02 '17 at 14:27
memory channels are accessed interleaved: they all are used when accessing sequentially large chunks of memory. This is an hardware architecture detail, transparent to the OS activity. – Sigi Oct 02 '17 at 14:53

user7860670 · Answer 3 · 2017-10-02T16:14:22.903

(continuation from comments)

Here is some built-in functions performance test for reference. It measures time required to reserve (by calling VirtualAlloc) and to bring into physical RAM (by calling VirtualLock) 40 GB block of memory.

#include <sdkddkver.h>
#include <Windows.h>

#include <intrin.h>

#include <array>
#include <iostream>
#include <memory>
#include <fcntl.h>
#include <io.h>
#include <stdio.h>

void
Handle_Error(const ::LPCWSTR psz_what)
{
    const auto error_code{::GetLastError()};
    ::std::array<::WCHAR, 512> buffer;
    const auto format_result
    (
        ::FormatMessageW
        (
            FORMAT_MESSAGE_FROM_SYSTEM
        ,   nullptr
        ,   error_code
        ,   0
        ,   buffer.data()
        ,   static_cast<::DWORD>(buffer.size())
        ,   nullptr
        )
    );
    const auto formatted{0 != format_result};
    if(!formatted)
    {
        const auto & default_message{L"no description"};
        ::memcpy(buffer.data(), default_message, sizeof(default_message));
    }
    buffer.back() = L'\0'; // just in case
    _setmode(_fileno(stdout), _O_U16TEXT);
    ::std::wcout << psz_what << ", error # " << error_code << ": " << buffer.data() << ::std::endl;
    system("pause");
    exit(-1);
}

void
Enable_Previllege(const ::LPCWSTR psz_name)
{
    ::TOKEN_PRIVILEGES tkp{};
    if(FALSE == ::LookupPrivilegeValueW(nullptr, psz_name, ::std::addressof(tkp.Privileges[0].Luid)))
    {
        Handle_Error(L"LookupPrivilegeValueW call failed");
    }
    const auto this_process_handle(::GetCurrentProcess()); // Returns pseudo handle (HANDLE)-1, no need to call CloseHandle
    ::HANDLE token_handle{};
    if(FALSE == ::OpenProcessToken(this_process_handle, TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, ::std::addressof(token_handle)))
    {
        Handle_Error(L"OpenProcessToken call failed");
    }
    if(NULL == token_handle)
    {
        Handle_Error(L"OpenProcessToken call returned invalid token handle");
    }
    tkp.PrivilegeCount = 1;
    tkp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
    if(FALSE == ::AdjustTokenPrivileges(token_handle, FALSE, ::std::addressof(tkp), 0, nullptr, nullptr))
    {
        Handle_Error(L"AdjustTokenPrivileges call failed");
    }
    if(FALSE == ::CloseHandle(token_handle))
    {
        Handle_Error(L"CloseHandle call failed");
    }
}

int main()
{
    constexpr const auto bytes_count{::SIZE_T{40} * ::SIZE_T{1024} * ::SIZE_T{1024} * ::SIZE_T{1024}};
    //  Make sure we can set asjust working set size and lock memory.
    Enable_Previllege(SE_INCREASE_QUOTA_NAME);
    Enable_Previllege(SE_LOCK_MEMORY_NAME);
    //  Make sure our working set is sufficient to hold that block + some little extra.
    constexpr const ::SIZE_T working_set_bytes_count{bytes_count + ::SIZE_T{4 * 1024 * 1024}};
    if(FALSE == ::SetProcessWorkingSetSize(::GetCurrentProcess(), working_set_bytes_count, working_set_bytes_count))
    {
        Handle_Error(L"SetProcessWorkingSetSize call failed");
    }
    //  Start timer.
    ::LARGE_INTEGER start_time;
    if(FALSE == ::QueryPerformanceCounter(::std::addressof(start_time)))
    {
        Handle_Error(L"QueryPerformanceCounter call failed");
    }
    //  Run test.
    const ::SIZE_T min_large_page_bytes_count{::GetLargePageMinimum()}; // if 0 then not supported
    const ::DWORD allocation_flags
    {
        (0u != min_large_page_bytes_count)
        ?
        ::DWORD{MEM_COMMIT | MEM_RESERVE} // | MEM_LARGE_PAGES} // need to enable large pages support for current user first
        :
        ::DWORD{MEM_COMMIT | MEM_RESERVE}
    };
    if((0u != min_large_page_bytes_count) && (0u != (bytes_count % min_large_page_bytes_count)))
    {
        Handle_Error(L"bytes_cout value is not suitable for large pages");
    }
    constexpr const ::DWORD protection_flags{PAGE_READWRITE};
    const auto p{::VirtualAlloc(nullptr, bytes_count, allocation_flags, protection_flags)};
    if(!p)
    {
        Handle_Error(L"VirtualAlloc call failed");
    }
    if(FALSE == ::VirtualLock(p, bytes_count))
    {
        Handle_Error(L"VirtualLock call failed");
    }
    //  Stop timer.
    ::LARGE_INTEGER finish_time;
    if(FALSE == ::QueryPerformanceCounter(::std::addressof(finish_time)))
    {
        Handle_Error(L"QueryPerformanceCounter call failed");
    }
    //  Cleanup.
    if(FALSE == ::VirtualUnlock(p, bytes_count))
    {
        Handle_Error(L"VirtualUnlock call failed");
    }
    if(FALSE == ::VirtualFree(p, 0, MEM_RELEASE))
    {
        Handle_Error(L"VirtualFree call failed");
    }
    //  Report results.
    ::LARGE_INTEGER freq;
    if(FALSE == ::QueryPerformanceFrequency(::std::addressof(freq)))
    {
        Handle_Error(L"QueryPerformanceFrequency call failed");
    }
    const auto elapsed_time_ms{((finish_time.QuadPart - start_time.QuadPart) * ::LONGLONG{1000u}) / freq.QuadPart};
    const auto rate_mbytesps{(bytes_count * ::SIZE_T{1000}) / static_cast<::SIZE_T>(elapsed_time_ms)};
    _setmode(_fileno(stdout), _O_U16TEXT);
    ::std::wcout << elapsed_time_ms << " ms " << rate_mbytesps << " MB/s " << ::std::endl;
    system("pause");
    return 0;
}

On my system, Windows 10 Pro, Xeon E3 1245 V5 @ 3.5GHz, 64 GB DDR4 (4x16), it outputs:

8188 ms 5245441250 MB/s

This code seems to utilize just a single core. Maximum from CPU specs is 34.1 GB/s. Your first code snippet takes ~11.5 seconds (in release mode VS does not omit the loop).

Enabling large pages probably will improve it a bit. Also notice that with VirtualLock pages can not go to swap, unlike the scenario with zeroing them manually. Large pages can not go to swap at all.

I ran similar code with `VirtualAllocExNuma`, but when I used `MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES` instead of `MEM_COMMIT | MEM_RESERVE` I got the error `1314` returned by `GetLastError` function which means: `A required privilege is not held by the client`. — IndustProg, Oct 04 '17 at 12:22
@GntS I just forgot to divide by megabyte, so it is actually B/s. Error 1314 most likely means that you haven't enabled large pages for the current user. — user7860670, Oct 04 '17 at 18:12
@GntS [See this question](https://stackoverflow.com/questions/42354504/enable-large-pages-in-windows-programmatically). — user7860670, Oct 05 '17 at 04:31