in a worst case how much QPI latency can slow-down arbitrary application?

Question

I'm developing low-latency HFT trading application.

I'm using single-CPU machine. Because it's much easier to configure and maintain, (no need to tune NUMA). Also, obviously, assuming we have enough resources, it should be definitely not slower than dual-CPU setup, and likely it will be a little bit faster, cause no QPI/NUMA latency.

HFT requires a lot of resources and now I realize I want to have much more cores. Also colocating two 1U single CPU machines is much more expensive than colocating one 1U dual-cpu machine, so even assuming I can "split" my program to two it's still make sense to use 1U dual-CPU machine.

So how fear QPI/NUMA latency is? If I move my application from single-CPU machine to dual-CPU machine how much slower it can be? Maximum I can afford several-microseconds delay, but not more. Can QPI/Numa introduce significant delay if not tuned correctly and how significant this delay would be?

Is it possible to write such application which runs much slower (more than several microseconds slower) on dual-CPU setup than single-CPU setup? I.e runs much slower on a faster computer? (of course assuming we have the same processors, memory, network card and everything else)

I think this question is too vague. What kind of answer do you expect? _"QPI will add 12 microseconds of latency"_, or something like that? For an "arbitrary application" at that. With NUMA you have to be careful not to access memory that is assigned to different cores. If you do, both cores suffer delays. (I don't have numbers for that, sorry. But it should be quantifiable, so maybe that is your question.) But it's not rocket science to avoid this issue. — Daniel Darabos, Oct 26 '15 at 21:52
I'll agree with the honourable @DanielDarabos, thou I seem to remember that addressing the neighbours ram cost you around 100 cycles extra or 25ns. — Surt, Oct 26 '15 at 21:59
You need to be using [low-latency hardware](http://www.supermicro.com/products/system/2u/6027/sys-6027ax-72rf-hft1.cfm). Also, a lot of your reasoning is just incorrect. For example, more physical CPUs means more caches, which has a massive effect on latency. — David Schwartz, Oct 26 '15 at 21:59
i agree that question is vague. I expect someone to explain why and how QPI slowdowns application and what is the worst scenario and how fear is it. for example is it possible that at some moment QPI start moving several gigabytes of RAM from one NUMA node to another what will introduce huge delay, say 100 microseconds or something like this? — Oleg Vazhnev, Oct 26 '15 at 22:36
@DavidSchwartz i'm familar with this hardware. now i'm using regular hardware. do you think that this 2-sockets E5-2687 v3 will be faster than 1-socket E5-2697 v3? (10 * 2 cores vs 14 cores) — Oleg Vazhnev, Oct 26 '15 at 22:43

score 3 · Accepted Answer · answered Oct 26 '15 at 22:59

This is not trivially answerable, since it depends on so many factors. Is the code written for NUMA?

Is the code doing mostly reads, mostly writes or about equal? How much data is shared between threads that run on separate CPUs? How often is such data written to, forcing cache-refresh?

How does tasks get scheduled, how and when does the OS decide to move threads from one CPU socket to the next?

Does the code and data fit in cache?

Those are just a few factors that will change the results dramatically between a "works really well" and "gives really poor performance".

As with EVERYTHING performance-related, details can make a huge difference, and reading answers like this one on the internet will not give you a reliable answer that applies to YOUR situati8on. Benchmark your application, check performance counters and tweak based on that. [Given the price for a machine of the specs you describe in comments above, I'd expect the supplier would allow some sort of test, demo, "try before you buy", etc].

Assuming you have a worst case scenario, a memory access will be straddling two cache-lines (unaligned access of a 8-byte value, for example), which is split between your worst placed CPUs, and the MMU needs reloading, each of those page-table entries also being in the worst possible CPUs, and since the memory for that pair of memory locations is in different locations, needing new TLB entries for each of the two 4-byte reads to load your 64-bit value. (Each TLB entry is a separate location).

This means 2 x 4 x n, where n is something like 50-100 ns. So one memory access could, at least in theory take 1600 ns. So 1.6 microseconds. It's unlikely that you will get MUCH worse than this for a single operation. The overhead is a lot less than for example swapping to disk, which can add milliseconds to your execution time.

It is not very hard to write code that updates the same cache-line on multiple CPUs and thus causing dramatic reduction in performance - I remember a long time back when I first had an Athlon SMP system running a simple benchmark, where the author did this for a Dhrystone benchmark

int numberOfRuns[MAX_CPUS];

Now, numberOfRuns is the outer loop-counter, and updating that for each loop, on either CPU, would cause "false sharing" (so each time the counter was updated, the other CPU had to flush that cache-line).

Running this on 2 core SMP system gave 30% of the single CPU performance. So 3 times SLOWER than the one CPU, rather than faster as you'd expect. (This was some 12 or so years ago, so memory may be a little "off" on the exact details, but the essense of this story is still true - a badly written application can run slower on multiple cores compared to single core).

I'd expect at least that bad performance on a modern system where you have false sharing of commonly used variables.

In comparison, well-written code should run near N times faster, if there is little or no sharing between CPU cores. I have a highly CPU-bound, multithreaded, calculator for weird numbers, which gives near n-times performance gain both on my single-socket system at home and my two-socket system at work.

$ time ./weird -t 1 -e 100000

real    0m22.641s
user    0m22.660s
sys 0m0.003s

$ time ./weird -t 6 -e 100000

real    0m5.096s
user    0m25.333s
sys 0m0.005s

So about 11% overhead. That is sharing one variable [current number] which is atomically updated between threads (using C++ standard atomics). Unfortunately, I don't have a good example of "badly written code" to contrast this against.

code is not optimized for NUMA at all (but ready for multicore machine of course). assuming that system is not tuned for NUMA too what can I expect for application - run faster or slower? NUMA adds latency, but having more cores means that i have better "threads-per-core" ratio. — Oleg Vazhnev, Oct 26 '15 at 23:09
Clearly the only answer to that question is "maybe" - sharing of data, slot-swapping from the OS and "fits in cache or not" will be the most critical factors. But without having access to your code, it's impossible to even guess. — Mats Petersson, Oct 26 '15 at 23:24

in a worst case how much QPI latency can slow-down arbitrary application?

1 Answers1