should I use thread affinity for "latency-critical" threads?

Question

In my HFT trading application I have several places where I receive data from network. In most cases this is just a thread that only receives and process data. Below is part of such processing:

    public Reciver(IPAddress mcastGroup, int mcastPort, IPAddress ipSource)
    {

        thread = new Thread(ReceiveData);

        s = new Socket(AddressFamily.InterNetwork, SocketType.Dgram, ProtocolType.Udp);
        s.ReceiveBufferSize = ReceiveBufferSize;

        var ipPort = new IPEndPoint(LISTEN_INTERFACE/* IPAddress.Any*/, mcastPort);
        s.Bind(ipPort);

        option = new byte[12];
        Buffer.BlockCopy(mcastGroup.GetAddressBytes(), 0, option, 0, 4);
        Buffer.BlockCopy(ipSource.GetAddressBytes(), 0, option, 4, 4);
        Buffer.BlockCopy(/*IPAddress.Any.GetAddressBytes()*/LISTEN_INTERFACE.GetAddressBytes(), 0, option, 8, 4);
    }

    public void ReceiveData()
    {
        byte[] byteIn = new byte[4096];
        while (needReceive)
        {
            if (IsConnected)
            {
                int count = 0;
                try
                {
                    count = s.Receive(byteIn);
                }
                catch (Exception e6)
                {
                    Console.WriteLine(e6.Message);
                    Log.Push(LogItemType.Error, e6.Message);
                    return;
                }
                if (count > 0)
                {
                    OnNewMessage(new NewMessageEventArgs(byteIn, count));
                }
            }
        }
    }

This thread works forever once created. I just wonder if I should configure this thread to run on certain core? As I need lowest latency I want to avoid context switch. As I want to avoid context switch I better to run the same thread on the same processor core, right?

Taking into account that i need lowest latency is that correct that:

It would be better to set "thread afinity" for the most part of the "long-running" threads?
It would be better to set "thread afinity" for the thread from my example above?

I rewriting above code to c++ right now to port to Linux later if this is important however I assume that my question is more about hardware than language or OS.

context switch refers to switching between threads. This is not affected by thread affinity. OS still has to schedule this and other threads. — Zdeslav Vojkovic, Mar 08 '13 at 13:52
No matter what you use latency will not be guaranteed, as many threads are executing in parallel on different cores as soon as you want some other thread to be scheduled there are some chores that are to be performed on the OS side and it is not a fixed time tasks. — Narendra Pathai, Mar 08 '13 at 13:52
Programs should not be such that they are totally dependent on the fixed latency. Such programs can break easily. — Narendra Pathai, Mar 08 '13 at 13:53
i don't need guaranteed latency. i need to minimize latency. it's trading. lower latency - more money i can earn. — Oleg Vazhnev, Mar 08 '13 at 13:57
Put it this way. there have been several posts about affinity bodges and how to do them in an attempt to improve latency/performance. AFAIK, nobdy has yet posted any follow-up saying that their app performance has improved at all. — Martin James, Mar 08 '13 at 14:00
No, don't do it. It **may** be better in a test but it'll break. Find reasons here: http://blogs.microsoft.co.il/blogs/sasha/archive/2008/04/20/parallelism-and-cpu-affinity.aspx — Adriano Repetti, Mar 08 '13 at 14:00

score 2 · Accepted Answer · answered Mar 08 '13 at 14:17

2

I think the algorithm that has as little latency as possible would be to pin your threads to one core and set them to realtime priority (or whatever is the highest one).

This will cause the OS to evict any other thread which happens to use that core.

Hopefully the CPU cache will still contain useful data when your thread gets scheduled there. For that reason I like the idea of pinning to a core.

You should probably set your entire process to a high priority class and minimize other activity on your box. Also turn off unused hardware because it might generate interrupts. Fix your NIC's interrupts to a different CPU core (some better NICs can do that).

answered Mar 08 '13 at 14:17

usr

168,620
35
240
369

Running something on Realtime priority on Windows isn't necessarily a good idea, it will make it impossible to control the machine if the process consumes all CPU. Just saying. – Tony The Lion Mar 08 '13 at 14:23
1

@TonyTheLion: power cycle fodder. – Martin James Mar 08 '13 at 14:38
fighting for nanoseconds of latency seems strange, as GC thread can interrupt execution at any moment :) – Zdeslav Vojkovic Mar 08 '13 at 14:57
@ZdeslavVojkovic - also, 'evicting' a thread running on another core is messy/lengthy - means hardware-interrupt of the other core. – Martin James Mar 08 '13 at 15:06
I am surely not convinced that other design changes would not have more effect without big downside. Pooling buffers instead of new(), using IOCP to avoid data copying and excessive kernel transitions, probably much more stuff. That, and rewrite in C or C++ so no GC. – Martin James Mar 08 '13 at 15:10
@MartinJames I agree with all of your points. Still, optimizing scheduling is a good idea because the old saying "leave scheduling to the OS" is simply impractical in certain cases like this one. Scheduling does not come out even nearly optimal in practice. Pinning threads to cores is a well-proven technique for very low latency or high throughput special-purpose apps. – usr Mar 08 '13 at 15:40

score 1 · Answer 2 · answered Mar 08 '13 at 13:52

1

As I want to avoid context switch I better to run the same thread on the same processor core, right?

No. A context switch will not necessarily be avoided by setting affinity to one CPU. You have no control over context switches, they are in the hands of the OS thread scheduler. They occur when a thread quantum (time slice) has elapsed or when a higher priority thread interrupts your thread.

Latency you talk about, I assume is network or memory latency, is not at all avoided by setting thread affinity. Memory latency can be avoided by making your code cache friendly (ie it can all be in the L1 - L2 caches, for example). Network latency is really just part of any network, and not something I suspect you can do much about.

answered Mar 08 '13 at 13:52

Tony The Lion

61,704
67
242
415

1

Yes, so +1. I presume s.Receive() blocks, so no point in any affinity bodges. – Martin James Mar 08 '13 at 13:57
why not setting thread afinity to my code? why thread from my example need to travel between cores? isn't it better to run it on the same core? – Oleg Vazhnev Mar 08 '13 at 13:59
@javapowered why do you so desperately want to have it on one core? What difference does it make to you? Unless you have a strong argument for doing so, I see no need to enforce that on the system. – Tony The Lion Mar 08 '13 at 14:02
@javapowered context switches and affinity are different things. Anyway no, **it's not better** because you do **not have exclusive control of that core/cpu**. If you lock your thread to that core it may be executed much later because **OS decided to use it for something else** too... – Adriano Repetti Mar 08 '13 at 14:02
|@javapowered because the core you bound to may well be in use by another thread when your network thread becomes ready. What do you suggest happens then? – Martin James Mar 08 '13 at 14:02
i have a lot of processor power. but i need minimal latency, microseconds are important. i want to avoid transferring "latency-critical" threads from core to core or from processor to processor (because it's obviosly expensive operation, at least cache need to be "reloaded"). for example having 12 cores i want to spend 6 cores to 6 "hot" threads. other 6 cores I want to use for everything else... – Oleg Vazhnev Mar 08 '13 at 14:09
@javapowered you can't force OS to do not use them for, let me say, run notepad. So if you're at that level you better consider to use another OS for such "real time" operations. Forcing affinity may works 80% of time (hiding the true problem) but it's a bomb that may explode. – Adriano Repetti Mar 08 '13 at 14:18
@Adriano to the best of my knoledge it's possible to force Linux (not real-time, regular) to use core for just one thread, however it's probably not trivial to configure, i can't tell how to do that exactly right now. – Oleg Vazhnev Mar 08 '13 at 14:20
1

@javapowered so far it's C# and managed threads may not even match physical threads. Moreover not only software but hardware too (interrupts) may use _your_ core. I think we read the same answer 1000, 1000000 times. Do you need real-time (or very strict scheduling)? Do not use a _common_ OS. No "if" and no "may I". It's always a compromise and if you're developing a mission critical system you can't accept it may fail. – Adriano Repetti Mar 08 '13 at 14:21
@Adriano i see, i can configure hardware to use another core. I rewriting to c++ right now so it will not be c#. and even for c# it's possible to use thread afinity, there are third-parties examples. of course in c# it's recommmended not to do so... – Oleg Vazhnev Mar 08 '13 at 14:24
@Adriano i'm not working in NASA. it's trading. i need minimal **average** delay. it's ok to have fail sometimes. – Oleg Vazhnev Mar 08 '13 at 14:25
@javapowered if what you require is just **good performance** are you sure it's a bottleneck? Did you measure actual average performance of your system? – Adriano Repetti Mar 08 '13 at 14:27
@Adriano i more need latency, not performance. Yes I measure latency, however I don't know how can I exactly to measure the latency of transferring thread from one core to another. I even don't know how to catch such situations. So I decided to run thread on dedicated core and to check if overall performance would be better. – Oleg Vazhnev Mar 08 '13 at 14:31
@javapowered fair but you should measure real world situation – Adriano Repetti Mar 08 '13 at 14:33
@javapowered - fine! Please do test it, and let us know. – Martin James Mar 08 '13 at 14:37

Zdeslav Vojkovic · Answer 3 · 2013-03-08T14:54:43.947

1

As Tony The Lion has already answered your question, I would like to address your comment:

"why not setting thread afinity to my code? why thread from my example need to travel between cores?"

Your thread doesn't travel anywhere.

Context switch happens when OS thread scheduler decides to give your thread a slice of time to execute. Then the environment is prepared for your thread, e.g. the CPU registers are set up to correct values etc. This is called context switch.

So regardless of thread affinity, the same CPU setup work has to be done, whether it is the same CPU/core which was used in previous slice when your thread was running or another one. And at this moments, your computer has more info to do it properly then you do at compile time.

You seem to believe that thread somehow resides on the CPU, but it is not so. What you use is a logical thread and there can be hundreds or even thousands of them. Common CPUs, OTOH, usually have 1 or 2 hardware threads per core, and your logical thread gets mapped to one of these every time it is scheduled, even if OS always picks the same HW thread.

EDIT: it seems that you have already picked the answer you want to hear and I don't like long discussion threads on answers so I will put it here.

you should try and measure it. I believe that you will be dissapointed
running some threads on high priority thread might easily mess up other processes
you are worried about context switch latency, but you have no problems that GC thread will freeze your thread? BTW, on which core will your GC thread run? :)
what if your highest priority thread blocks GC thread? memory leaks? do you know what is priority of that thread so you are sure it would work?
really, why not C or hand optimized assembly if microseconds are important?
as someone suggested, you should use an RTOS if you want to control this aspect of execution
it doesn't seem likely that your data travels through data center just 4-5 times slower than it takes to setup a thread context on one machine, but who knows...

edited Mar 08 '13 at 14:54

answered Mar 08 '13 at 14:18

Zdeslav Vojkovic

14,391
32
45

one my thread uses "busy waiting". however no CPU core is loaded to 100%. only when i've set "thread afinity" for this thread my core was loaded to 100%. So thread was transferring between cores. Of course it's very bad for latency because this eliminates idea of cache. each processor has own cache and even each core has some local cache afaik. – Oleg Vazhnev Mar 08 '13 at 14:29
there is no transferring. when your thread is interrupted, the CPU doesn't know anything about it until next time the thread is scheduled. you thread is not somehow left on the old CPU. What you see is that OS schedules your thread 4 times to same core, instead 1 time to 4 different cores, but every times it schedules it, it does the same amount of work. The increase of CPU load is more or less the same as decrease of load of other cores, and the time is more or less the same (leaving aside cache issues) – Zdeslav Vojkovic Mar 08 '13 at 14:33
so what thread affinity is all about? are you saying that this is absolutely useless option? – Oleg Vazhnev Mar 08 '13 at 14:35
disaggree, network latency is constant, say 300 microseconds. so if one guy have overall latency of 350 microseconds and another of 355 microseconds the first guy will earn all money and the seconds guy will earn nothing. – Oleg Vazhnev Mar 08 '13 at 14:37
network latency is pretty much constant in co-location data centers of the exchange and people pay a lot of money for this service. – Oleg Vazhnev Mar 08 '13 at 14:40
I have removed the comments to here and added them to my answer, to avoid long discussion – Zdeslav Vojkovic Mar 08 '13 at 14:55
Oh yeah - C#, GC, missed that:(( – Martin James Mar 08 '13 at 15:02
'one my thread uses "busy waiting"' - missed that as well. Memory-bandwidth down the drain. – Martin James Mar 08 '13 at 15:12
the more one knows about threading, the less he is willing to mess with it. – Zdeslav Vojkovic Mar 08 '13 at 15:16

should I use thread affinity for "latency-critical" threads?

3 Answers3