Low-latency communication between threads in the same process

Question

Console application has 3 threads: Main, T1, T2. The goal is to 'signal' both T1, T2 (and let them do some work) from the Main thread in the lowest latency as possible (μs)

NOTE:

please ignore Jitter, GC etc. (I can handle that)
ElapsedLogger.WriteLine call cost is below 50ns (nano sec)

Have a look at the code below:

sample 1

class Program
{
    private static string msg = string.Empty;
    private static readonly CountdownEvent Countdown = new CountdownEvent(1);

    static void Main(string[] args)
    {
        while (true)
        {
            Countdown.Reset(1);
            var t1 = new Thread(Dowork) { Priority = ThreadPriority.Highest };
            var t2 = new Thread(Dowork) { Priority = ThreadPriority.Highest };
            t1.Start();
            t2.Start();

            Console.WriteLine("Type message and press [enter] to start");
            msg = Console.ReadLine();

            ElapsedLogger.WriteLine("Kick off!");
            Countdown.Signal();

            Thread.Sleep(250);
            ElapsedLogger.FlushToConsole();
        }
    }
    private static void Dowork()
    {
        string t = Thread.CurrentThread.ManagedThreadId.ToString();
        ElapsedLogger.WriteLine("{0} - Waiting...", t);

        Countdown.Wait();

        ElapsedLogger.WriteLine("{0} - Message received: {1}", t, msg);
    }
}

Output:

Type message and press [enter] to start
test3
20141028 12:03:24.230647|5 - Waiting...
20141028 12:03:24.230851|6 - Waiting...
20141028 12:03:30.640351|Kick off!
20141028 12:03:30.640392|5 - Message received: test3
20141028 12:03:30.640394|6 - Message received: test3

Type message and press [enter] to start
test4
20141028 12:03:30.891853|7 - Waiting...
20141028 12:03:30.892072|8 - Waiting...
20141028 12:03:42.024499|Kick off!
20141028 12:03:42.024538|7 - Message received: test4
20141028 12:03:42.024551|8 - Message received: test4

In the above code 'latency' is around 40-50μs. CountdownEvent signaling call is very cheap (less than 50ns) but T1,T2 threads are suspended and it takes time to wake them up.

sample 2

class Program
{
    private static string _msg = string.Empty;
    private static bool _signal = false;

    static void Main(string[] args)
    {
        while (true)
        {
            _signal = false;
            var t1 = new Thread(Dowork) {Priority = ThreadPriority.Highest};
            var t2 = new Thread(Dowork) {Priority = ThreadPriority.Highest};
            t1.Start();
            t2.Start();

            Console.WriteLine("Type message and press [enter] to start");
            _msg = Console.ReadLine();

            ElapsedLogger.WriteLine("Kick off!");
            _signal = true;

            Thread.Sleep(250);
            ElapsedLogger.FlushToConsole();
        }
    }
    private static void Dowork()
    {
        string t = Thread.CurrentThread.ManagedThreadId.ToString();
        ElapsedLogger.WriteLine("{0} - Waiting...", t);

        while (!_signal) { Thread.SpinWait(10); }

        ElapsedLogger.WriteLine("{0} - Message received: {1}", t, _msg);
    }
}

Output:

Type message and press [enter] to start
testMsg
20141028 11:56:57.829870|5 - Waiting...
20141028 11:56:57.830121|6 - Waiting...
20141028 11:57:05.456075|Kick off!
20141028 11:57:05.456081|6 - Message received: testMsg
20141028 11:57:05.456081|5 - Message received: testMsg

Type message and press [enter] to start
testMsg2
20141028 11:57:05.707528|7 - Waiting...
20141028 11:57:05.707754|8 - Waiting...
20141028 11:57:57.535549|Kick off!
20141028 11:57:57.535576|7 - Message received: testMsg2
20141028 11:57:57.535576|8 - Message received: testMsg2

This time 'latency' is around 6-7μs. (but high CPU) This is because T1,T2 threads are forced to be active (they doing nothing just burn CPU time)

In 'real' application I cannot spin CPU like that (I have far to many active threads and it would make it worse/slower or even kill the server).

Is it anything I can use instead to drop latency to something around 10-15 μs? I guess with Producer/Consumer pattern it won't make is quicker than using CountdownEvent. Wait/Pulse is also more expensive than CountdownEvent.

Is what I got in sample 1 the best I can achieve?

Any suggestions?

I'll try raw sockets as well when I have a time.

Have you investigated `ManualResetEventSlim` and `SemaphoreSlim`? What about `Monitor.Wait` and `Monitor.Pulse`? — Jim Mischel, Oct 30 '14 at 02:31
@JimMischel: I've tried all of them and result is pretty much the same. — Novitzky, Oct 30 '14 at 09:58

score 3 · Answer 1 · answered Jul 03 '18 at 15:12

You tried to oversimplify this and then whichever way you turn something is going to bite you. Thread.SpinWait(int) was never meant to be used alone and as a blunt instrument. To use it you need to pre-calculate, essentially calibrate (based on the current system info, clock, scheduler interrupt timer interval) the optimal number of iterations for spin lock. After you exhaust that budget you need to voluntary sleep/yield/wait. The whole arrangement is usually called 2-level wait or 2-phase wait.

You need to be aware that once you cross that line your minimal latency is the scheduler interrupt timer interval (ClockRes from System Internals, at least 1 ms on Win10, if any "measurement" gives you lower value either measurement is broken or you didn't really go to sleep). On 2016 Server minimum is 12 ms.

How you measure is very important. If you call some kernel functions to measure local/in-process time that will give you seductively low numbers but they are not real. If you use QueryPerformanceCounter (Stopwatch class uses it) measurement resolution is 1000 real ticks (1/3 μs on a 3 GHz CPU). If you use RDTSC nominal resolution is CPU clock but that's terribly jittery and gives you the illusion of precision that's not there. These 333 ns are the absolutely smallest interval you can measure reliably without VTune or hardware tracer.

On to Sleepers

Thread.Yield() is the lightest but with a caveat. On an idle system it's a nop => you are back to too a tight spinner. On a busy system it's at least the time till the next scheduler interval which is almost the same as sleep(0) but without the overhead. Also it will switch only to a thread that's already scheduled to run on the same core which means that it has higher chances of degenerating into nop.

SpinWait struct is next lightest. It does it's own 2-level wait but with hard spin and yield, meaning that it still needs real 2nd level. Bit id does the counting math for you and will tell you when it's going to yield which you can take as a signal to go to sleep.

ManualResetEventSlim is the next lightest and on a busy system it might be faster than yield since it can continue if threads involved didn't go to sleep and their quantum budget is not exhausted.

Thread.Sleep(int) is next. Sleep(0) is considered lighter since it doesn't have time evaluation and yields only to threads with same or higher priority but for your low latency purposes it doesn't mean much. Sleep(1) unconditionally yields even to lower priority threads and has time evaluation code path but the minimal timer slice is 1 ms anyway. Both end up sleeping longer since on a busy system there's always plenty of threads with same or higher priority to make sure that it won't have much chances of running in the next slice.

Raising thread priorities to real time level will help only temporarily. Kernel has a defense mechanism that will kick their priorities down after a short run - meaning that you'll need to keep re-raising them every time they run. Windows is not an RTOS.

Any time you go to sleep, via any method, you have to expect at least one time slice delay. Avoiding such delay is exactly the use case for spin locks. Any time you go to sleep, via any method, you have to expect at least one time slice delay. Condition Variables could be potential "middle ground" in theory but since C#/.NET don't have native support for that you'd have to import a dll and call native functions and there is no guarantee that the'll be ultra responsive. Immediate wake up is never guaranteed - even in C++. To do something like that you'd have to hijack an interrupt - impossible in .NET, very hard in C++ and risky.

Using CPU time is actually not bad if your cores are memory bound and starved, which is routinely the case with CPU oversubscription (too many threads for the number of cores) and large in-memory crawlers (indexes, graphs, anything else you keep locked in memory on the GB scale). Then they don't have anything else to do anyway.

If however you are computation intensive (ALU and FPU bound) then spinning can be bad.

Hyperthreading is always bad. Under stress it will heat up cores a lot and lower perf since they are fake pseudo-processors with very little truly independent hardware. Thread.Yield() was more or less invented to lower the pressure from hyperthreading but if you are chasing low latency first rule is - turn hyperthreads off for good.

Also be aware that any measurement for these kinds of things without a hardware tracer or VTune and without careful management of thread-core affinities is pointless. You'll see all kinds of mirages and won't see what's really important - the effect of trashed CPU caches, their latency and memory latency. Plus, you really need a test box that is replica of what's running live, in production, since huge number of factors depend on nuances of concrete usage patterns and they are not reproducible on a substantially different configuration.

Reserving Cores

You'll need to reserve a number of cores for exclusive use by your latency critical threads, 1 per core if it's very critical. If you go with 1-1 then plain spinning is perfectly fine. Otherwise yield is perfectly fine. This is the real use-case for SpinWait struct and having that reserved and clean state is the first pre-condition. With 1-1 setup relatively simple measurements become relevant again and even RDTSC becomes smooth enough for regular use.

That realm of carefully guarded cores and super-threads can be your own little RTOS but you need to be very careful and you have to mange everything. Can't go to sleep, if you do, you are back to scheduler time slice delay.

If you have very deterministic state and a calculation that N of them have the time to run before the usual latency budget is spent you can go for fibers and then you control everything.

The number of these super-threads per core depends on what are they doing, are they memory bound, how much memory to they need and the number of them that can coexist in the same cache without trashing each other's lines. Need to do the math for all 3 caches and be conservative. This is also where VTune or hardware tracer can help a lot - then you can just run and see.

Oh and the hardware doesn't have to be prohibitively expensive for these things anymore. Ryzen Threadripper with 16 cores can do it just fine.

score 1 · Answer 2 · answered Oct 29 '14 at 19:06

1

I agree that the SpinWait() approach is not realistic for production use. Your threads will have to go to sleep and be woken up.

I see you looked at wait/Pulse. Have you benchmarked any of the other primitives available in .net? Joe Albahari's "Threading in C#" has an exhaustive review of all of your options. http://www.albahari.com/threading/part4.aspx#_Signaling_with_Wait_and_Pulse

One point I want to touch on: How confident are you in the timestamps produced by ElapsedLogger?

answered Oct 29 '14 at 19:06

sevzas

701
2
5
13

Thanks for the link - I know his 'Threading in C#' very well. I've tried benchmarking a lot of other things, always using the same settings (hardware/OS/optimization/GC/jitter/etc). WaitAndPulse signaling gives me the same latency as CountdownEvent. Monitor.PulseAll() call is more expensive than CountdownEvent.Signal() but both under 100-150ns. Regarding the ElapsedLogger, resolution is way below 1us and I am pretty much sure about that. – Novitzky Oct 30 '14 at 00:32

score 1 · Accepted Answer · answered Nov 03 '14 at 01:32

1

There's not a whole lot that can be done, since the other thread has to be scheduled by the OS.

Increasing the priority of the waiting thread is the only thing likely to make much difference, and you've already done that. You could go even higher.

If you really need the lowest possible latency for activation of another task, you should turn it into a function that can be called directly from the triggering thread.

answered Nov 03 '14 at 01:32

Ben Voigt

277,958
43
419
720

Ben, thanks for the answer. Two questions: 1) How I can go even higher with thread priority? Highest thread is scheduled before threads with any other priority. 2) If I execute a function directly from the calling thread there will be no multi-threading at all. I will then execute task one by one rather than in parallel. Are you suggesting this function kick another thread instead? I've already tried this and I cannot see any improvement. – Novitzky Nov 03 '14 at 20:22
1) You used "Highest priority" which is not as high as THREAD_PRIORITY_TIME_CRITICAL. 2) You have three actions taking place right now -- waking two threads and continuing the current function. Whichever one of the three tasks those three threads are responsible for is the most latency critical should be performed on the current thread. – Ben Voigt Nov 03 '14 at 20:54
Thanks for THREAD_PRIORITY_TIME_CRITICAL. I'll play with it. Regarding point 2, in real app I have between 1 to 8 threads which are latency critical (all need to be triggered by other thread). – Novitzky Nov 03 '14 at 23:36
Based on having "between 1 and 8 threads", sounds a bit like a thread pool/worker pool. Maybe you can set things up in such a way so that any given worker thread is capable of using both the CountdownEvent or the spin wait. Perhaps the next worker to receive work can be using spin wait, but the rest can be using CountdownEvent. Once a worker receives work, the next worker in line starts to spin wait. This way you get the lower latency of the Spin wait for practically all tasks, but you use just one core instead of one core per worker. – sevzas Nov 24 '14 at 13:35

Low-latency communication between threads in the same process

3 Answers3