Why Parallel.For gives only so little gain for this particular function?

Question

I'm having trouble understanding why my "concurrent" implementation of this http://www.codeproject.com/Tips/447938/High-performance-Csharp-byte-array-to-hex-string-t function has only ~20% performance gain.

For convenience here is the code from that site:

static readonly int[] toHexTable = new int[] {
    3145776, 3211312, 3276848, 3342384, 3407920, 3473456, 3538992, 3604528, 3670064, 3735600,
    4259888, 4325424, 4390960, 4456496, 4522032, 4587568, 3145777, 3211313, 3276849, 3342385,
    3407921, 3473457, 3538993, 3604529, 3670065, 3735601, 4259889, 4325425, 4390961, 4456497,
    4522033, 4587569, 3145778, 3211314, 3276850, 3342386, 3407922, 3473458, 3538994, 3604530,
    3670066, 3735602, 4259890, 4325426, 4390962, 4456498, 4522034, 4587570, 3145779, 3211315,
    3276851, 3342387, 3407923, 3473459, 3538995, 3604531, 3670067, 3735603, 4259891, 4325427,
    4390963, 4456499, 4522035, 4587571, 3145780, 3211316, 3276852, 3342388, 3407924, 3473460,
    3538996, 3604532, 3670068, 3735604, 4259892, 4325428, 4390964, 4456500, 4522036, 4587572,
    3145781, 3211317, 3276853, 3342389, 3407925, 3473461, 3538997, 3604533, 3670069, 3735605,
    4259893, 4325429, 4390965, 4456501, 4522037, 4587573, 3145782, 3211318, 3276854, 3342390,
    3407926, 3473462, 3538998, 3604534, 3670070, 3735606, 4259894, 4325430, 4390966, 4456502,
    4522038, 4587574, 3145783, 3211319, 3276855, 3342391, 3407927, 3473463, 3538999, 3604535,
    3670071, 3735607, 4259895, 4325431, 4390967, 4456503, 4522039, 4587575, 3145784, 3211320,
    3276856, 3342392, 3407928, 3473464, 3539000, 3604536, 3670072, 3735608, 4259896, 4325432,
    4390968, 4456504, 4522040, 4587576, 3145785, 3211321, 3276857, 3342393, 3407929, 3473465,
    3539001, 3604537, 3670073, 3735609, 4259897, 4325433, 4390969, 4456505, 4522041, 4587577,
    3145793, 3211329, 3276865, 3342401, 3407937, 3473473, 3539009, 3604545, 3670081, 3735617,
    4259905, 4325441, 4390977, 4456513, 4522049, 4587585, 3145794, 3211330, 3276866, 3342402,
    3407938, 3473474, 3539010, 3604546, 3670082, 3735618, 4259906, 4325442, 4390978, 4456514,
    4522050, 4587586, 3145795, 3211331, 3276867, 3342403, 3407939, 3473475, 3539011, 3604547,
    3670083, 3735619, 4259907, 4325443, 4390979, 4456515, 4522051, 4587587, 3145796, 3211332,
    3276868, 3342404, 3407940, 3473476, 3539012, 3604548, 3670084, 3735620, 4259908, 4325444,
    4390980, 4456516, 4522052, 4587588, 3145797, 3211333, 3276869, 3342405, 3407941, 3473477,
    3539013, 3604549, 3670085, 3735621, 4259909, 4325445, 4390981, 4456517, 4522053, 4587589,
    3145798, 3211334, 3276870, 3342406, 3407942, 3473478, 3539014, 3604550, 3670086, 3735622,
    4259910, 4325446, 4390982, 4456518, 4522054, 4587590
};

public static unsafe string ToHex1(byte[] source)
{
    fixed (int* hexRef = toHexTable)
    fixed (byte* sourceRef = source)
    {
        byte* s = sourceRef;
        int resultLen = (source.Length << 1);

        var result = new string(' ', resultLen);
        fixed (char* resultRef = result)
        {
            int* pair = (int*)resultRef;

            while (*pair != 0)
                *pair++ = hexRef[*s++];
            return result;
        }
    }
}

Here are my "improvements":

public static unsafe string ToHex1p(byte[] source)
{
    var chunks = Environment.ProcessorCount;
    var n = (int)Math.Ceiling(source.Length / (double)chunks);

    int resultLen = (source.Length << 1);

    var result = new string(' ', resultLen);

    Parallel.For(0, chunks, k =>
    {
        var l = Math.Min(source.Length, (k + 1) * n);
        fixed (char* resultRef = result) fixed (byte* sourceRef = source)
        {
            int from = n * k;
            int to = (int)resultRef + (l << 2);

            int* pair = (int*)resultRef + from;
            byte* s = sourceRef + from;
            while ((int)pair != to)
                *pair++ = toHexTable[*s++];
        }
    });

    return result;
}

Edit 1 This is how I time the functions:

var n = 0xff;
var s = new System.Diagnostics.Stopwatch();
var d = Enumerable.Repeat<byte>(0xce, (int)Math.Pow(2, 23)).ToArray();

s.Start();
for (var i = 0; i < n; ++i)
{
    Binary.ToHex1(d);
}
Console.WriteLine(s.ElapsedMilliseconds / (double)n);

s.Restart();
for (var i = 0; i < n; ++i)
{
    Binary.ToHex1p(d);
}
Console.WriteLine(s.ElapsedMilliseconds / (double)n);

Why is this inside and not outside the For loop ? fixed (char* resultRef = result) fixed (byte* sourceRef = source) - see http://stackoverflow.com/questions/8497018/what-is-the-overhead-of-c-sharp-fixed-statement-on-a-managed-unsafe-struct-conta — tolanj, Nov 25 '14 at 23:54
@tolanj: I can't answer for the OP, but I suspect it's because it's a lot easier to put the `fixed` inside the anonymous method than outside, due to the rules about capturing pointers. For grins, I went ahead and tested it with the `fixed` on the outside, and found it didn't really matter anyway. Note that in this case, the `fixed` statements are executed only once per thread; the time-consuming loop _is_ inside the `fixed` statements. — Peter Duniho, Nov 26 '14 at 02:27
@PeterDuniho: Agree on the evaluation order. On the debug-vs-release, your experience is *much* different than mine. That's especially true when you compare running with the debugger attached and not attached. I've seen wildly different relative timings. That is, algorithm A is significantly faster than B with debug, and significantly slower than B under release. — Jim Mischel, Nov 26 '14 at 03:21
@JimMischel: I've never seen that kind of disparity (significant performance difference in both tests and debug vs release is exactly switched) when the _only_ difference between the algorithms is the degree of parallelism. I agree it's quite common for other algorithmic differences. I would be interested to see an example of the former. — Peter Duniho, Nov 26 '14 at 03:29
You doing too much work per loop in your `Parallel.For` you likely at most are using 1-2 threads as the TPL "Warms up". Go read the free eBook from microsoft "[Patterns of parallel programming](http://www.microsoft.com/en-us/download/details.aspx?id=19222)", specifically you fell in to the anti-pattern "TOO FINE-GRAINED, TOO COARSE GRAINED" described on page 28 of the C# version. Your loop is only 4-8 in size, way too coarse grained to get any benefits from TPL. — Scott Chamberlain, Nov 26 '14 at 06:24
@ScottChamberlain I would count 20% as a significant benefit — ristiisa, Nov 26 '14 at 07:35
@JimMischel It all depends on the code itself but I would say based on my experience that usually it is better to run without the debugger attached. For instance if I ran my test(release x86 build debugger attached) on my 3930k instead of 3740QM the sequencial is twice as slow as parallel. — ristiisa, Nov 26 '14 at 07:49
I'm surprised no one has asked...What version of .net are you using? 4.0 has a terrible implementation of TPL that has about 100% overhead in most of the benchmarks I've seen. — Aron, Nov 26 '14 at 08:44
Multiple issues with the implementation: All tasks modify the same global variable (`result`) thus requiring synchronization, trying to guess the number of Tasks in order to modify the global variable. Instead of a global variable, use the `ForEach` [override](http://msdn.microsoft.com/en-us/library/dd991486(v=vs.110).aspx) that accepts final actions to collect the results of each Task into a single variable. Or use PLINQ to generate the partial results with Select, then combine them with Aggregate — Panagiotis Kanavos, Nov 26 '14 at 11:07
@PanagiotisKanavos the memory is accessed directly, there is no "variable" to share — ristiisa, Nov 26 '14 at 11:17
@tolanj Unfortunately I do not know exactly how expensive fixing is but as as Peter Duniho already mentioned the placement of the keyword had no noticeable impact. So I left it in the thread because the code looked cleaner :) — ristiisa, Nov 26 '14 at 11:21
Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackoverflow.com/rooms/65681/discussion-on-question-by-ristiisa-why-parallel-for-gives-only-so-little-gain-fo). — Taryn, Nov 26 '14 at 11:23
@ristiisa a variable is a pointer to a memory location. You *are* trying to access the same memory pages, thus requiring synchronization at the CPU level and invalidating each CPU's cache. Multiple processors can't modify the same memory blocks at the same time without some sort of synchronization. That's why each task shouldn't share state or data with other tasks. — Panagiotis Kanavos, Nov 26 '14 at 11:25
@PanagiotisKanavos: Whereas it's true that memory synchronization *could be* an issue, it's unlikely in this case. The `result` variable is essentially an array that's 8 megabytes in size. So with four threads, each thread is working on its own 2 megabyte chunk. Cache line size is typically 64 bytes, so it's highly unlikely that there is any kind of memory contention going on here between CPUs. They're simply not modifying the same cache lines concurrently. In the worst case, the only contention would be at the edges (the 64 bytes surrounding the 2 MB borders). — Jim Mischel, Nov 26 '14 at 13:00

Peter Duniho · Answer 1 · 2014-11-26T04:36:40.800

After playing around with your example some, I conclude that the bulk of the disparity in timing you're seeing is due to GC overhead, with initialization overhead in both scenarios being high enough to make performance differences relatively unimportant even once the GC overhead is removed from the test.

When I switch the order of the tests, the parallel one winds up faster than the non-parallel one. This is the first sign that the test isn't fair. :)

When I changed the tests so that I called GC.Collect() after each test, to ensure the GC was quiet during the subsequent one, the parallel version consistently came out ahead. But only barely so; the start-up time for each thread was in all cases over half the total duration of each parallel test.

As part of my testing, I modified the code so that it tracked the actual time spent in each thread of the For() version. Here, I found that the time spent in this code was about what I'd expect based on the non-parallel version (i.e. reasonably close to the time divided by the thread count).

(Of course, one could obtain this information using a profiler too).

Here's the output for the tests I ran with GC.Collect(). For the parallel test, this is also showing the start (relative to the overall test start time) and duration times for each thread.

Running the non-parallel version first, parallel version second:

Single-thread version: 00:00:00.6726813
Parallel version: 00:00:00.6270247
  Thread #0: start: 00:00:00.3343985, duration: 00:00:00.2925963
  Thread #1: start: 00:00:00.3345640, duration: 00:00:00.2805527

Single-thread version: 00:00:00.7027335
Parallel version: 00:00:00.5610246
  Thread #0: start: 00:00:00.3305695, duration: 00:00:00.2304486
  Thread #1: start: 00:00:00.3305857, duration: 00:00:00.2300950

Single-thread version: 00:00:00.6609645
Parallel version: 00:00:00.6143675
  Thread #0: start: 00:00:00.3391491, duration: 00:00:00.2750529
  Thread #1: start: 00:00:00.3391560, duration: 00:00:00.2705631

Single-thread version: 00:00:00.6655265
Parallel version: 00:00:00.6246624
  Thread #0: start: 00:00:00.3227595, duration: 00:00:00.2924611
  Thread #1: start: 00:00:00.3227831, duration: 00:00:00.3018066

Single-thread version: 00:00:00.6815009
Parallel version: 00:00:00.5707794
  Thread #0: start: 00:00:00.3227074, duration: 00:00:00.2480668
  Thread #1: start: 00:00:00.3227330, duration: 00:00:00.2478351

Running the parallel version first, non-parallel second:

Parallel version: 00:00:00.5807343
  Thread #0: start: 00:00:00.3397320, duration: 00:00:00.2409767
  Thread #1: start: 00:00:00.3398103, duration: 00:00:00.2408334
Single-thread version: 00:00:00.6974992

Parallel version: 00:00:00.5801044
  Thread #0: start: 00:00:00.3305571, duration: 00:00:00.2495409
  Thread #1: start: 00:00:00.3305746, duration: 00:00:00.2492993
Single-thread version: 00:00:00.7442493

Parallel version: 00:00:00.5845514
  Thread #0: start: 00:00:00.3454512, duration: 00:00:00.2352147
  Thread #1: start: 00:00:00.3454756, duration: 00:00:00.2389522
Single-thread version: 00:00:00.6542540

Parallel version: 00:00:00.5909125
  Thread #0: start: 00:00:00.3356177, duration: 00:00:00.2550365
  Thread #1: start: 00:00:00.3356250, duration: 00:00:00.2552392
Single-thread version: 00:00:00.7609139

Parallel version: 00:00:00.5777678
  Thread #0: start: 00:00:00.3440084, duration: 00:00:00.2337504
  Thread #1: start: 00:00:00.3440323, duration: 00:00:00.2329294
Single-thread version: 00:00:00.6596119

Lessons learned:

Performance testing is tricky, especially in the managed environment. Things like garbage collection and just-in-time compilation make it hard to compare apples-to-apples
The actual computational cost of converting bytes to characters is completely inconsequential compared to anything else a program might spend its time doing (such as preparing and invoking threads). This particular algorithm does not appear worth parallelizing; even though you do get a consistent improvement in speed, it's pretty marginal due to all the overhead around the actual computation.

One final note: another source of error in these kinds of tests is Intel's Hyperthreading. Or rather, if you test while using the Hyperthread-enabled CPU count, you'll get misleading results. For example, I tested this on my Intel i5-based laptop, which reports having 4 cores. But running four threads won't come close to a 4x speed-up over a non-parallel implementation, while running two threads will be close to a 2x speed-up (for the right problem). That's why even though my computer reports 4 CPUs, I tested with only 2 threads.

Here, there's so much other misleading overhead in this test that I don't think Hyperthreading makes a big difference. But it's something to watch out for.

Good job. It's curious that a `GC.Collect` didn't even things out. It'd be interesting to see if your results are different with `gcServer enabled=true` in the app.config. — Jim Mischel, Nov 26 '14 at 03:38
What do you mean by "even things out"? Using `GC.Collect()` did actually allow the parallel version to consistently win. It just doesn't win by much due to the non-parallel overhead. — Peter Duniho, Nov 26 '14 at 03:39
Then I don't understand your results. In the first case (parallel runs second), it looks like the parallel version averages 25 ms. In the second case (parallel runs first), it looks like the parallel version runs in about 13 ms. What am I misunderstanding? — Jim Mischel, Nov 26 '14 at 04:05
I think I did not adequately explain the format of the output. The overall test times are the lines that are times only (i.e. no other text). The lines with the words "Start" and "Duration" are the per-thread times _for the parallel version only_. Sorry if I made things more confusing than they needed to be. — Peter Duniho, Nov 26 '14 at 04:08
On my 3930k I got the same results each time 13ms and 18ms for 2^23 data size, parallel and sequential respectively. Did not matter which one I ran first or if I called GC.Collect between them or if I ran them individually. Actually with 3930k I don't even see the problem... — ristiisa, Nov 26 '14 at 08:08
@JimMischel with `gcServer enabled=true` it was roughly 8..10% slower for both parallel and sequencial — ristiisa, Nov 26 '14 at 08:13
So, essentially to summarize, this is a perfect example of where there just isn't a point to running in parallel because the activity doesn't necessitate it. — theMayer, Nov 27 '14 at 15:44
@theMayer: yes, exactly. This string conversion routine is already extremely fast, to the point of 250MB of encoding taking around the same amount of time as the overhead of starting up parallelism. It's hard to imagine having to encode even 250MB as hexadecimal, and you'd have to have a lot more data than that to encode before it was worthwhile to run the processing in parallel. Heck, the single-threaded encoding routine is so fast that it's arguable that even optimizing _it_ was overkill. — Peter Duniho, Nov 27 '14 at 18:11

score 0 · Answer 2 · answered Nov 27 '14 at 15:51

I read in the commentary of the first answer (which, by the way, is more informative than the question and comments altogether) that the execution time for these tests is somewhere on the order of 25ms at a maximum.

There are a lot of things to say about this, but the first is "What a waste of good programmer time!"

This is a clear case where you are over-optimizing. On my first glance through the code, I thought "why in the world would anyone want to parallelism this?" You're doing bitwise operations, which are extremely rapid. There is not enough performance penalty in the first place to justify parallelism.

Now, onto your test discrepancies. The TPL is a complex library, and parallelism is an order of magnitude more complex than single-threading. As a result, there are a number of factors that come into play when you run a test, several of which (but by no means all) are mentioned in various comments. Every factor affects the results in an unpredictable way, and you can think of each factor as adding a little bit of variation. In the end, your test results don't have enough power to distinguish between the two, which is yet another indication of over-optimization.

As programmers, we need to remember as a general rule that our time is extremely expensive, especially when compared to computer time. Opportunities to achieve true, value-added performance gain are rare when programming, and typically center around very expensive processing or cases where you need to request from remote servers, etc. There are definitely times when it is needed, but this is not one of them.

Why Parallel.For gives only so little gain for this particular function?

2 Answers2