Parallel.For() slows down with repeated execution. What should I look at?

Question

I wrote a naive Parallel.For() loop in C#, shown below. I also did the same work using a regular for() loop to compare single-thread vs. multi-thread. The single thread version took about five seconds every time I ran it. The parallel version took about three seconds at first, but if I ran it about four times, it would slow down dramatically. Most often it took about thirty seconds. One time it took eighty seconds. If I restarted the program, the parallel version would start out fast again, but slow down after three or four parallel runs. Sometimes the parallel runs would speed up again to the original three seconds then slow down.

I wrote another Parallel.For() loop for computing Mandelbrot set members (discarding the results) because I figured that the problem might be related to memory issues allocating and manipulating a large array. The Parallel.For() implementation of this second problem does indeed execute faster than the single-thread version every time, and the times are consistent too.

What data should I be looking at to understand to understand why my first naive program slows down after a number of runs? Is there something in Perfmon I should be looking at? I still suspect it is memory related, but I allocate the array outside the timer. I also tried a GC.Collect() at the end of each run, but that didn't seem help, not consistently anyway. Might it be an alignment issue with cache somewhere on the processor? How would I figure that out? Is there anything else that might be the cause?

JR

    const int _meg = 1024 * 1024;
    const int _len = 1024 * _meg;

    private void ParallelArray() {
        int[] stuff = new int[_meg];
        System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
        lblStart.Content = DateTime.Now.ToString();
        s.Start();

        Parallel.For(0,
            _len,
            i => {
                stuff[i % _meg] = i;
            }
            );
        s.Stop();

        lblResult.Content = DateTime.Now.ToString();

        lblDiff.Content = s.ElapsedMilliseconds.ToString();

    }

I believe that is what I posted. The "ParallelArray()" routine was called off a button click, and the lbl* elements were labels in a .NET 4.5 WPF application. Are you having trouble executing it? — jrv, Sep 11 '14 at 17:53
It could be problem with this machine and not inherent to the problem/the Data Parallel Library/manipulating large arrays, but I would still be interested in trying to find out what is causing it. — jrv, Sep 11 '14 at 18:07
Is your code doing anything else while `Parallel.For` is running? Have you tried isolating just this method as I have? — Zer0, Sep 11 '14 at 18:13
It's a demo program. There's nothing else going on. I was trying to create a simple demonstration of the Task Parallel Library. — jrv, Sep 11 '14 at 18:20
FWIW I didn't use WPF. I ran my test in a console app. Worked just fine on my end. — Zer0, Sep 11 '14 at 18:57
I created a console app that ran the ParallelArray() routine ten times. I had six calls at about 2.5 seconds, and four calls at roughly 25 seconds, which is consistent in flavor with what I saw in WPF. — jrv, Sep 11 '14 at 19:25
At this point I'd run it under a profiler. See what's actually happening when it slows down. I'd gladly help if I could reproduce. — Zer0, Sep 11 '14 at 19:40
You are running so little work per work item that you shouldn't expect any speedup anyway. The Parallel.* methods have considerable overhead. Even the delegate call alone could be slower than the modulo operator and the store operation. (This of course does not explain the slowdown. Just a general remark that you won't be successful with this approach.) — usr, Sep 11 '14 at 22:35
The speedup was about 2:1 when everything worked as expected. As you say, it probably wasn't the best choice to demonstrate, but it was the first thing that came to mind. — jrv, Sep 12 '14 at 19:36

Alois Kraus · Accepted Answer · 2014-09-11T22:28:54.407

I have profiled your code and it indeed looks strange. There should be no deviations. It is not an allocation issue (GC is fine and you are allocating only one array per run).

The problem can be reproduced on my Haswell CPU where the parallel version suddenly takes much longer to execute. I have CLR version 4.0.30319.34209 FX452RTMGDR.

On x64 it works fine and has no issues. Only x86 builds seem to suffer from it. I have profiled it with the Windows Performance Toolkit and have found that it looks like a CLR issue where the TPL tries to find the next workitem. Sometimes it happens that the call

System.Threading.Tasks.RangeWorker.FindNewWork(Int64 ByRef, Int64 ByRef)
System.Threading.Tasks.Parallel+<>c__DisplayClassf`1[[System.__Canon, mscorlib]].<ForWorker>b__c()
System.Threading.Tasks.Task.InnerInvoke()
System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)
System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object)
System.Threading.Tasks.Task.InnerInvoke()

seems to "hang" in the clr itself. clr!COMInterlocked::ExchangeAdd64+0x4d

When I compare the sampled stacks with a slow and fast run I find:

ntdll.dll!__RtlUserThreadStart  -52%
kernel32.dll!BaseThreadInitThunk  -52%
ntdll.dll!_RtlUserThreadStart  -52% 
clr.dll!Thread::intermediateThreadProc  -48%
clr.dll!ThreadpoolMgr::ExecuteWorkRequest  -48%
clr.dll!ManagedPerAppDomainTPCount::DispatchWorkItem  -48%
clr.dll!ManagedThreadBase_FullTransitionWithAD  -48%
clr.dll!ManagedThreadBase_DispatchOuter  -48%
clr.dll!ManagedThreadBase_DispatchMiddle  -48%
clr.dll!ManagedThreadBase_DispatchInner  -48%
clr.dll!QueueUserWorkItemManagedCallback  -48% 
clr.dll!MethodDescCallSite::CallTargetWorker  -48%
clr.dll!CallDescrWorkerWithHandler  -48%
mscorlib.ni.dll!System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteEntry(Boolean)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.TaskByRef)  -48%
mscorlib.ni.dll!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext System.Threading.ContextCallback System.Object Boolean)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.ExecutionContextCallback(System.Object)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.Execute()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke()  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task+<>c__DisplayClass11.<ExecuteSelfReplicating>b__10(System.Object)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)  -48%
mscorlib.ni.dll!System.Threading.Tasks.Task.InnerInvoke()  -48%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0  -24%
ParllelForSlowDown.exe!ParllelForSlowDown.Program+<>c__DisplayClass1::<ParallelArray>b__0<itself>  -24%
...
clr.dll!COMInterlocked::ExchangeAdd64  +50%

In the dysfunctional case most of the time (50%) is spent in clr.dll!COMInterlocked::ExchangeAdd64. This method was compiled with FPO since the stacks were broken in the middle to get more performance. I have thought that such code is not allowed in the Windows Code base because it makes profiling harder. Looks like the optimizations have gone too far. When I single step with the debugger to the actual exachange operation

eax=01c761bf ebx=01c761cf ecx=00000000 edx=00000000 esi=00000000 edi=0274047c
eip=747ca4bd esp=050bf6fc ebp=01c761bf iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
clr!COMInterlocked::ExchangeAdd64+0x49:
747ca4bd f00fc70f        lock cmpxchg8b qword ptr [edi] ds:002b:0274047c=0000000001c761bf

cmpxchg8b compares EDX:EAX=1c761bf with the memory location and if the values equal copy the new value of ECX:EBX=1c761cf to the memory location. When you look at the registers you find that at index 0x1c761bf = 29.843.903 all values are not equal. Looks like there is an race condition (or excessive contention) when incrementing the global loop counter which surfaces only when your method body does so little work that it pops out.

Congrats you have found a real bug in the .NET Framework! You should report it at the connect website to make them aware of this issue.

To be absolutely sure that it is not another issue you can try the parallel loop with an empty delegate:

    System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();
    s.Start();
    Parallel.For(0,_len, i => {});
    s.Stop();
    System.Console.WriteLine(s.ElapsedMilliseconds.ToString());

This does also repro the issue. It is therefore definitely a CLR issue. Normally we at SO tell people to not try to write lock free code since it is very hard to get right. But even the smartest guys at MS seem to get it wrong sometimes ....

Update: I have opened a bug report here: https://connect.microsoft.com/VisualStudio/feedbackdetail/view/969699/parallel-for-causes-random-slowdowns-in-x86-processes

+1 Great find. Just want to confirm code ran fine with x64 but once I switched to x86 I could reproduce. — Zer0, Sep 11 '14 at 22:16
I have to mark this as correct, even though I don't have the expertise to be certain. I was able to duplicate the issue using the empty delegate too. The first three times everything ran ok; after that, it went really bad. — jrv, Sep 12 '14 at 19:43
Windows Performance Toolkit: http://geekswithblogs.net/akraus1/archive/2013/08/03/153594.aspx and a little more infos here: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx and — Alois Kraus, Sep 12 '14 at 20:20

Matt · Answer 2 · 2014-09-11T20:40:19.357

Based on your program, I wrote a program to reproduce the problem. I think it is related to .NET large object heap and how Parallel.For implemented.

class Program
    {
        static void Main(string[] args)
        {
            for (int i = 0; i < 10; i++)
                //ParallelArray();
                SingleFor();
        }

        const int _meg = 1024 * 1024;
        const int _len = 1024 * _meg;

         static void ParallelArray()
        {
            int[] stuff = new int[_meg];
            System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();           
            s.Start();
            Parallel.For(0,
                _len,
                i =>
                {
                    stuff[i % _meg] = i;
                }
                );
            s.Stop();          

         System.Console.WriteLine( s.ElapsedMilliseconds.ToString());

        }

         static void SingleFor()
         {
             int[] stuff = new int[_meg];
             System.Diagnostics.Stopwatch s = new System.Diagnostics.Stopwatch();

             s.Start();

             for (int i = 0; i < _len; i++){
                     stuff[i % _meg] = i;
                 }

             s.Stop();            

             System.Console.WriteLine(s.ElapsedMilliseconds.ToString());
         }
    }

I compiled with VS2013, release version, and run it without debugger. If the function ParallelArray() is called in the main loop, the result I got is:

if the function SingleFor() is called, the result is:

I go through some documentation on MSDN about Parallel.For, this caught my attention: Writing to shared variables. If the body of a loop writes to a shared variable, there is a loop body dependency. This is a common case that occurs when you are aggregating values. As in the Parallel for loop, we're using a shared variable stuff.

This article Parallel Aggregation explain how .NET deal with this case:The Parallel Aggregation pattern uses unshared, local variables that are merged at the end of the computation to give the final result. Using unshared, local variables for partial, locally calculated results is how the steps of a loop can become independent of each other. Parallel aggregation demonstrates the principle that it's usually better to make changes to your algorithm than to add synchronization primitives to an existing algorithm. This means it creates local copies of data instead of using locks to guard the shared variable, and at the end, these 10 partitions needs to be combined together; this brings performance penalties.

When I run the test program with Parall.For, I used the the process explore to count the threads, it has 11 threads, so Parallel.For create 10 partitions for the loops, which mean it creates 10 local copies with size 100K, these object will be placed on Large Object Heap.

There are two different types of heaps in .NET. The Small Object Heap (SOH) and the Large Object Heap (LOH). If the object size is large than 85,000 bytes, it is in LOH. When doing GC, .NET treat the 2 heaps differently.

As it is explained in this blog: No More Memory Fragmentation on the .NET Large Object Heap: One of the key differences between the heaps is that the SOH compacts memory and hence reduces the chance of memory fragmentation dramatically while the LOH does not employ compaction. As a result, excessive usage of the LOH may result in memory fragmentation that can become severe enough to cause problems in applications.

As you're allocating big arrays with size > 85,000 continuously, when the LOH becomes in memory fragmentation, the performance goes down.

If you're using .NET 4.5.1, you can set GCSettings.LargeObjectHeapCompactionMode to CompactOnce to make LOH compact after GC.Collect().

Another good article to understand this problem is: Large Object Heap Uncovered

Further investigation is needed, but I don't have time now.

That's a thought I had, but I don't see how it would apply. Each run performs a single allocation at the start and outside the timers and the Parallel.For(). Either the allocation succeeds (possibly after a GC) or it fails. If it fails, bad things happen (i.e. an out-of-memory exception or similar). If it succeeds, the timer and Parallel.For() start and performance should be the same. — jrv, Sep 11 '14 at 18:04
I also think if that's the problem it would affect the version with a simple `for` loop too, not only the `Parallel.For` version. — Wagner DosAnjos, Sep 11 '14 at 18:09
@jrv, then I suggest to use the CLR profiler to check if LOH is in fragmentation: http://clrprofiler.codeplex.com/ — Matt, Sep 11 '14 at 18:25

Parallel.For() slows down with repeated execution. What should I look at?

2 Answers2