I am experimenting with NUMA on a machine that has 4 Operton 6272 processors, running centOS. There are 8 NUMA nodes, each with 16GB memory.
Here is a small test program I'm running.
void pin_to_core(size_t core)
{
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
int main()
{
pin_to_core( 0 );
size_t bufSize = 100;
for( int i = 0; i < 131000; ++i )
{
if( !(i % 10) )
{
std::cout << i << std::endl;
long long free = 0;
for( unsigned j = 0; j < 8; ++j )
{
numa_node_size64( j, &free );
std::cout << "Free on node " << j << ": " << free << std::endl;
}
}
char* buf = (char*)numa_alloc_onnode( bufSize, 5 );
for( unsigned j = 0; j < bufSize; ++j )
buf[j] = j;
}
return 0;
}
So basically a thread running on core #0 allocates 131K 100-byte buffers on NUMA node 5, initializes them with junk and leaks them. Once every 10 iterations we print out information about how much memory is available on each NUMA node.
In the beginning of the output I get:
0
Free on node 0: 16115879936
Free on node 1: 16667398144
Free on node 2: 16730402816
Free on node 3: 16529108992
Free on node 4: 16624508928
Free on node 5: 16361529344
Free on node 6: 16747118592
Free on node 7: 16631336960
...
And at the end I'm getting:
Free on node 0: 15826657280
Free on node 1: 16667123712
Free on node 2: 16731033600
Free on node 3: 16529358848
Free on node 4: 16624885760
Free on node 5: 16093630464
Free on node 6: 16747384832
Free on node 7: 16631332864
130970
Free on node 0: 15826657280
Free on node 1: 16667123712
Free on node 2: 16731033600
Free on node 3: 16529358848
Free on node 4: 16624885760
Free on node 5: 16093630464
Free on node 6: 16747384832
Free on node 7: 16631332864
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
mbind: Cannot allocate memory
130980
...
Things that are not clear to me:
1) Why are there those "mbind: Cannot allocate memory" messages? The fact that I'm far from using up all of the memory and the behaviour doesn't change if I change the buffer size to, say, 1000 leads me to think that I'm running out of some kind of a kernel resource handles.
2) Even though I asked for the memory to be allocated on node 5, the actual allocations seem to have been split between nodes 0 and 5.
Can anyone please give any insights into why this is happening?
UPDATE
Would like to give more detail on point (2). The fact that some of the memory isn't allocated on node 5 seems to have something to do with the fact that we are initializing the buffer on core #0 (that belongs to NUMA node 0). If I change pin_to_core(0)
to pin_to_core(8)
then the allocated memory is split between nodes 1 and 5. If it is pin_to_core(40)
then all the memory is allocated on node 5.
UPDATE2
I've looked at the source code of libnuma and tried replacing the call to numa_alloc_onnode()
with more low-level calls from there: mmap()
and mbind()
. I'm now also checking on which NUMA node does the memory reside - I use the move_pages()
call for that. The results are as follows. Before initialization (the loop over j
) the page is not mapped to any node (I get ENOENT error code) and after initialization the page is assigned either to node 0 or to node 5. The pattern is regular: 5,0,5,0,... As before, when we get close to the 131000-th iteration the calls to mbind()
start returning error codes, and when this happens the page is always allocated to node 0. The error code returned by mbind is ENOMEM, the documentation says this means running out of "kernel memory". I don't know what it is, but it can't be "physical" memory because I have 16GB per node.
So here are my conclusions so far:
The restrictions on memory mapping imposed by
mbind()
are held up only 50% of the times when a core of another NUMA node touches memory first. I wish this was documented somewhere, because quietly breaking a promise is not nice...There is a limit on the number of calls to
mbind
. So one shouldmbind()
big memory chunks whenever possible.
The approach that I'm going to try is: do memory allocation tasks on threads that are pinned to cores of particular NUMA ndoes. For extra peace of mind I will try calling mlock (because of issues described here).