Worse multithreaded performance on better system (possibly due to Deedle)

Question

We are dealing with a multithreaded C# service using Deedle. Tests on a quad-core current system versus an octa-core target system show that the service is about two times slower on the target system instead of two times faster. Even when restricting the number of threads to two, the target system is still almost 40% slower.

Analysis shows a lot of waiting in Deedle(/F#), making the target system basically run on two core. Non-Deedle test programs show normal behaviour and superiour memory bandwidth on the target system.

Any ideas on what could cause this or how to best approach this situation?

EDIT: It seems most of the time waiting is done in calls to Invoke.

I don't think Deedle does any sophisticated thread synchronization that would cause the program to run slow - though if you are accessing the same frame/series from multiple threads, it might have effect on CPU caches - it sounds more like the overhead of parallelization is greater than the benefit from it. — Tomas Petricek, Jul 11 '16 at 12:40
@TomasPetricek As for our usage, the threads can be seen as pretty much independent. Of course, it's hard for me to say what happens in libraries (or even deeper in CLR). As far as I have been able to measure (e.g. with Intel's PCM), caching isn't an issue. The odd thing is that you would at least expect it to run similarly to the current system when limiting the number of threads (on both systems). — mweerden, Jul 11 '16 at 13:54
Do you have enough sleep statements in the code that make the thread go idle as soon as it has nothing to do anymore? It seems threads are competing, they may be just looping without taking breaks? — Martin Maat, Jul 15 '16 at 09:28
@MartinMaat There are no sleep statements, but I don't believe that is necessary in most modern contexts. Afaik, there is pretty fair scheduling going on. Also, see my answer below; the issue has been fixed. ;) — mweerden, Jul 15 '16 at 09:39
You don't want fair scheduling, you want the thread that is doing the work to get the juice, not the ones that are just looping (those should be sleepng most of the time). I don't know how your service is set up but concidering your answer I think you may have a resource hog on your hands. — Martin Maat, Jul 15 '16 at 14:07

score 1 · Accepted Answer · answered Jul 15 '16 at 09:12

The problem turned out to be a combination of using Windows 7, .NET 4.5 (or actually the 4.0 runtime) and the heavy use of tail recursion in F#/Deedle.

Using Visual Studio's Concurrency Visualizer, I already found that most time is spent waiting in Invoke calls. On closer inspection, these result in the following call trace:

ntdll.dll:RtlEnterCriticalSection
ntdll.dll:RtlpLookupDynamicFunctionEntry
ntdll.dll:RtlLookupFunctionEntry
clr.dll:JIT_TailCall
<some Deedle/F# thing>.Invoke

Searching for these function gave multiple articles and forum threads indicating that using F# can result in a lot of calls to JIT_TailCall and that .NET 4.6 has a new JIT compiler that seems to deal with some issues relating to these calls. Although I didn't find anything mentioning problems relating to locking/synchronisation, this did give me the idea that updating to .NET 4.6 might be a solution.

However, on my own Windows 8.1 system that also uses .NET 4.5, the problem doesn't occur. After searching a bit for similar Invoke calls, I found that the call trace on this system looked as follows:

ntdll.dll:RtlAcquireSRWLockShared
ntdll.dll:RtlpLookupDynamicFunctionEntry
ntdll.dll:RtlLookupFunctionEntry
clr.dll:JIT_TailCall
<some Deedle/F# thing>.Invoke

Apparently, in Windows 8(.1) the locking mechanism was changed to something less strict, which resulted in a lot less need for waiting for the lock.

So only with the target system's combination of both Windows 7's strict locking and .NET 4.5's less efficient JIT compiler, did F#'s heavy usage of tail recursion cause problems. After updating to .NET 4.6, the problem disappeared and our service is running as expected.

Worse multithreaded performance on better system (possibly due to Deedle)

1 Answers1

Linked