3

While writing a solution for a coding problem I discovered an interesting behavior of my LINQ statements. I had two scenarios:

First:

arr.Select(x => x + 5).OrderBy(x => x)

Second:

arr.OrderBy(x => x).Select(x => x + 5)

After a little bit of testing with System.Diagnostics.Stopwatch I got the following results for an integer array of length 100_000.

For the first approach:

00:00:00.0000152

For the second:

00:00:00.0073650

Now I'm interested in why it takes more time if I do the ordering first. I wasn't able to find something on google so I thought about it by myself.

I ended up with 2 Ideas:
1. The second scenario has to convert to IOrderedEnumerable and then back to IEnumerable while the first scenario only has to convert to IOrderedEnumerable and not back.
2. You end up having 2 loops. The first for sorting and the second for the selecting while approach 1 does everything in 1 loop.

So my question is why does it take much more time to do the ordering before select?

  • 3
    Is this linq2entities or linq2objects or what? The optimizations and the performance highly depend on the underlying provider. – MakePeaceGreatAgain Jun 06 '19 at 08:24
  • I don't know what you mean by that. I just run the query on a normal integer array. I don't have any access to a database or something like that. – niklasstoffers Jun 06 '19 at 08:27
  • By the way both of your assumptions are whrong. 1:, an `IOrederEnumerable` **is** an `IEnumerable` and thus conversion does not cost anything. 2: the query does not determine how many loops are actually **executed**. – MakePeaceGreatAgain Jun 06 '19 at 08:27
  • ok that's right haven't thought about that – niklasstoffers Jun 06 '19 at 08:29
  • 2
    Just a sidenote that when you benchmark two sets of code you should try working with larger data sets because the margin of error is very small and there could be other factors in play. Try making them last at least a few seconds. – Mark Cilia Vincenti Jun 06 '19 at 08:29
  • Proper benchmarking is not easy thing. You should trust your measurements. Use this https://benchmarkdotnet.org/articles/overview.html – mtkachenko Jun 06 '19 at 08:29
  • 6
    note: `arr.Select(x => x + 5).OrderBy(x => x)` by itself **doesn't actually do anything** (except build the map of what to do later) - nor does `arr.OrderBy(x => x).Select(x => x + 5)`. LINQ in this case is *deferred execution* - unless you show how / if you're actually enumerating that... – Marc Gravell Jun 06 '19 at 08:32
  • Try a larger data set and enumerate as @MarcGravell said. Basically change it to ```var output = arr.Select(x => x + 5).OrderBy(x => x).ToList();``` – Mark Cilia Vincenti Jun 06 '19 at 08:34
  • @MarcGravell You're right for the benchmarking I had a .ToArray() at the end of the query – niklasstoffers Jun 06 '19 at 08:35
  • @niklasstoffers well, if you *actually care about the perf here*, you probably wouldn't use LINQ *anyway*: `var x = new int[arr.Length]; for (int i = 0 ; i < arr.Length; i++) x[i] = arr[i] + 5; Array.Sort(x);` - I would expect that to be noticeably faster. Perhaps also thinking about `ArrayPool` etc if it is a temp array. – Marc Gravell Jun 06 '19 at 08:38
  • @MarcGravell I'm actually not 100% sure about that. As far as I know the performance of OrderBy is better when dealing with large arrays. – niklasstoffers Jun 06 '19 at 08:47
  • @niklasstoffers compared to an in-place sort? as they say: [citation needed] - and don't forget the sheer lambda overhead is very significant here – Marc Gravell Jun 06 '19 at 08:53
  • @niklasstoffers using a modified version of Dmitry's test rig: OrderBySelect: 223, 226, 227, 228, 228, 230 average : 227 SelectOrderBy: 217, 219, 219, 220, 222, 223 average : 220 InPlaceSort: 57, 57, 57, 57, 57, 57 average : 57 - so, about 4 times faster – Marc Gravell Jun 06 '19 at 09:01
  • @MarcGravell Ok you're right. Haven't thought of the in-place thing. – niklasstoffers Jun 06 '19 at 09:16

2 Answers2

3

Let's have a look at the sequences:

private static void UnderTestOrderBySelect(int[] arr) {
  var query = arr.OrderBy(x => x).Select(x => x + 5); 

  foreach (var item in query)
    ;
}

private static void UnderTestSelectOrderBy(int[] arr) {
  var query = arr.Select(x => x + 5).OrderBy(x => x);  

  foreach (var item in query)
    ;
}

// See Marc Gravell's comment; let's compare Linq and inplace Array.Sort
private static void UnderTestInPlaceSort(int[] arr) {
  var tmp = arr;
  var x = new int[tmp.Length];

  for (int i = 0; i < tmp.Length; i++)
    x[i] = tmp[i] + 5;

  Array.Sort(x);
}

In order to perform benchmark, let's run 10 times and average 6 middle results:

private static string Benchmark(Action<int[]> methodUnderTest) {
  List<long> results = new List<long>();

  int n = 10;

  for (int i = 0; i < n; ++i) {
    Random random = new Random(1);

    int[] arr = Enumerable
      .Range(0, 10000000)
      .Select(x => random.Next(1000000000))
      .ToArray();

    Stopwatch sw = new Stopwatch();

    sw.Start();

    methodUnderTest(arr);

    sw.Stop();

    results.Add(sw.ElapsedMilliseconds);
  }

  var valid = results
    .OrderBy(x => x)
    .Skip(2)                  // get rid of top 2 runs
    .Take(results.Count - 4)  // get rid of bottom 2 runs
    .ToArray();

  return $"{string.Join(", ", valid)} average : {(long) (valid.Average() + 0.5)}";
}

Time to run and have a look at the results:

  string report = string.Join(Environment.NewLine,
    $"OrderBy + Select: {Benchmark(UnderTestOrderBySelect)}",
    $"Select + OrderBy: {Benchmark(UnderSelectOrderBy)}",
    $"Inplace Sort:     {Benchmark(UnderTestInPlaceSort)}");

  Console.WriteLine(report);

Outcome: (Core i7 3.8GHz, .Net 4.8 IA64)

OrderBy + Select: 4869, 4870, 4872, 4874, 4878, 4895 average : 4876
Select + OrderBy: 4763, 4763, 4793, 4802, 4827, 4849 average : 4800
Inplace Sort:     888, 889, 890, 893, 896, 904 average : 893

I don't see any significant difference, Select + OrderBy seems to be slightly more efficient (about 2% gain) than OrderBy + Select. Inplace Sort, however, has far better performance (5 times) than any of Linq.

Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
  • Should the array initialization be during the test time? – Amy B Jun 06 '19 at 08:55
  • @Amy B: Nice catch! Thank you! I've changed the experiment – Dmitry Bychenko Jun 06 '19 at 09:00
  • 1
    can I propose an addition for context: `private static void InPlaceSort() { var tmp = arr; var x = new int[tmp.Length]; for (int i = 0; i < tmp.Length; i++) x[i] = tmp[i] + 5; Array.Sort(x); }` - i.e. no LINQ – Marc Gravell Jun 06 '19 at 09:05
  • @Marc Gravell: As we can expect inplace sort (`Array.Sort`) has far better performance (`5` times faster) – Dmitry Bychenko Jun 06 '19 at 09:16
  • 1
    @DmitryBychenko thanks - that matches my results; I'm always slightly depressed when people insist on using LINQ on everything : fine for readability, but it isn't a performance API – Marc Gravell Jun 06 '19 at 09:20
2

Depending on which Linq-provider you have, there may happen some optimization on the query. E.g. if you´d use some kind of database, chances are high your provider would create the exact same query for both statements similar to this one:

select myColumn from myTable order by myColumn;

Thus performamce should be identical, no matter if you order first in Linq or select first.

As this does not seem to happen here, you probably use Linq2Objects, which has no optimization at all. So the order of your statements may have an efffect, in particular if you´d have some kind of filter using Where which would filter many objects out so that later statements won´t operate on the entire collection.

To keep long things short: the difference most probably comes from some internal initialzation-logic. As a dataset of 100000 numbers is not really big - at least not big enough - even some fast initialization has a big impact.

MakePeaceGreatAgain
  • 35,491
  • 6
  • 60
  • 111