How to make program NUMA ready?

Question

My program uses shared memory as a data storage. This data must be available to any application running, and fetching this data must be fast. But some applications can run on different NUMA nodes, and data access for them is realy expensive. Is data duplication for every NUMA node is the only way for doing this?

It is very depend on how (in what order) your program accesses data, and how it does write it — osgx, Aug 05 '11 at 11:38
So, it is "absolutely unpredictable" how make the program NUMA-ready. You can just start the program as on SMP, and if it has bad memory access pattern, it will run slow. (Numa allow access to other node's memory with higher cost than to local memory) — osgx, Aug 05 '11 at 11:56

score 5 · Accepted Answer · edited Feb 23 '12 at 23:39

There are two primary sources of slowdown that can be attributed to NUMA. The first is the increased latency of remote access which can vary depending on the platform. On the platforms that I work with, there is about a 30% hit in latency.

The other source of performance loss can come from contention over the communication links and controllers between NUMA nodes.

The default allocation scheme for Linux is to allocate the data on the node where it was created. If majority of the data in the application is initialized by a single thread then it'll generate a lot of cross NUMA domain traffic and contention for that one memory node.

If your data is read only, then replication is a good solution.

Otherwise, interleaving the data allocation across all your nodes will distribute the requests across all the nodes and will help relieve congestion.

To interleave the data, you can use set_mempolicy() from numaif.h if you are using Linux.

How to make program NUMA ready?

1 Answers1