8

Consider the following simple code with LINQ OrderBy and ThenBy:

static void Main()
{
  var arr1 = new[] { "Alpha", "Bravo", "Charlie", };

  var coStr = Comparer<string>.Create((x, y) =>
  {
    Console.WriteLine($"Strings: {x} versus {y}");
    return string.CompareOrdinal(x, y);
  });

  arr1.OrderBy(x => x, coStr).ToList();

  Console.WriteLine("--");

  var arr2 = new[]
  {
    new { P = "Alpha", Q = 7, },
    new { P = "Bravo", Q = 9, },
    new { P = "Charlie", Q = 13, },
  };

  var coInt = Comparer<int>.Create((x, y) =>
  {
    Console.WriteLine($"Ints: {x} versus {y}");
    return x.CompareTo(y);
  });

  arr2.OrderBy(x => x.P, coStr).ThenBy(x => x.Q, coInt).ToList();
}

This simply uses some comparers that write out to the console what they compare.

On my hardware and version of the Framework (.NET 4.6.2), this is the output:

Strings: Bravo versus Alpha
Strings: Bravo versus Bravo
Strings: Bravo versus Charlie
Strings: Bravo versus Bravo
--
Strings: Bravo versus Alpha
Strings: Bravo versus Bravo
Ints: 9 versus 9
Strings: Bravo versus Charlie
Strings: Bravo versus Bravo
Ints: 9 versus 9

My question is: Why would they compare an item from the query to itself?

In the first case, before the -- separator, they do four comparisons. Two of them compare an entry to itself ("Strings: Bravo versus Bravo"). Why?

In the second case, there should not ever be a need for resorting to comparing the Q properties (integers); for there are no duplicates (wrt. ordinal comparison) in the P values, so no tie-breaking from ThenBy should be needed ever. Still we see "Ints: 9 versus 9" twice. Why use the ThenBy comparer with identical arguments?

Note: Any comparer has to return 0 upon comparing something to itself. So unless the algorithm just wants to check if we implemented a comparer correctly (which it will never be able to do fully anyway), what is going on?

Be aware: There are no duplicates in the elements yielded by the queries in my examples.

I saw the same issue with another example with more entries yielded from the query. Above I just give a small example. This happens with an even number of elements yielded, as well.

Jeppe Stig Nielsen
  • 60,409
  • 11
  • 110
  • 181
  • They use a simple iteration without a check before invoking the compare? fail ;-) – Jeroen van Langen Aug 29 '17 at 07:44
  • This may be a matter of unnecessarily contrived evaluation. It'd be relatively easy to compare indexes in the beginning, but if the same code also keeps checking this when the output array has already shifted _some_ of its elements, that means that the `OrderBy` logic needs to start tracking which element it put where to know that it's the same. This might create more overhead for a comparatively negligible performance boost, as `OrderBy` will generally be used on large datasets (therefore minimizing how often equal elements are checked). – Flater Aug 29 '17 at 07:51
  • @Flater Maybe it would not want to keep track of "original indices". But would it ever duplicate one entry and put it in _two_ places at a time? Then how can it remember to remove those extra duplicates in the end? Because the count of the query is not allowed to grow when sorting. – Jeppe Stig Nielsen Aug 29 '17 at 07:55
  • 4
    [reference source](http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,28b56d00f5cd66de,references) You may take your own version of this `EnumerableSorter` and debug what happens. I guess that the "boundaries" (`i` and `j`) will eventually point to the same map entry as `x`. And to avoid that would be more overhead than the rare comparisons of elements with themselvs. – René Vogt Aug 29 '17 at 07:59
  • Once you know its quick sort, you can find plenty of questions (here and elsewhere) of people asking why sometimes elements are compared with themselves. – Damien_The_Unbeliever Aug 29 '17 at 08:02
  • @JeppeStigNielsen: Possibly because it first checks for the correct position, before then removing and placing the item. As to why the check isn't skipped for this element; I think there is little to no performance gain here. Instead of a single unneeded compare of equal items, you would otherwise **always** have to pre-check if `currentItem` isn't equal to `thisItemInTheList` before you compare them, which for large datasets might create more work than it actually solves. It would also create an added distinction in logic for value types vs reference types, which complicates things. – Flater Aug 29 '17 at 08:07
  • I.e. https://stackoverflow.com/questions/7373652/in-the-listt-sort-method-is-an-item-ever-compared-to-itself could be a duplicate... If all elements are the same there is no way to avoid this comparison. – Alexei Levenkov Aug 29 '17 at 08:08
  • @Damien_The_Unbeliever It is supposed to be a _stable_ sort ([documented](https://msdn.microsoft.com/en-us/library/bb549422.aspx)), so if it is quick sort, it is not the usual variant of quick sort. But I agree with René Vogt that if you read the source carefully enough, you will eventually understand what is going on. – Jeppe Stig Nielsen Aug 29 '17 at 08:13
  • @AlexeiLevenkov The other thread is related, although the algorithm for `List<>.Sort` and `Array.Sort<>` is different from that of `OrderBy`. And the former has been changed and improved since the other thread was active. – Jeppe Stig Nielsen Aug 29 '17 at 08:16
  • 2
    Without being able to ask the original authors of the algorithm, any answers are likely to be primarily opinion-based. My opinion-based answer is that the authors got the (fairly complicated) code working without making additional optimisations to prevent comparison of duplicates, and decided it wasn't worth spending the time (and risking breaking the code) to add what would likely be an insignificant performance benefit, – Matthew Watson Aug 29 '17 at 08:41
  • My oppinion on this: the fact that "a" comparing to "a" means is equal, might in some edge cases not be valid... maybe "a" is actually not equal to "a" on some scenarios (you may want to return -1 or +1 for some obscure reason). You can't judge ALL possible scenarios when creating a generic library – Jcl Aug 29 '17 at 08:52

4 Answers4

2

In the reference source of the QuickSort method used by OrderBy you can see these two lines:

while (i < map.Length && CompareKeys(x, map[i]) > 0) i++;
while (j >= 0 && CompareKeys(x, map[j]) < 0) j--;

These while loops run until they find an element that is no longer "greater" (resp. "less") than the one x points to. So they will break when the identical element is compared.

I can't prove it mathematical, but I guess to avoid comparing identical elements would make the algorithm more complicated and introduce overhead that would impact performance more than this single comparison.
(Note that your comparer should be implemented clever enough to quickly return 0 for identical elements)

René Vogt
  • 43,056
  • 14
  • 77
  • 99
  • You: _"your comparer should be implemented clever enough to quickly return 0 for identical elements"_ That is not possible in all cases. If `TKey` (the type we project onto with `OrderBy`) is a value type (copy by value) and holds a number of instance fields, there is no short-cut possible. We have to check each and every field to determine that the two "compound" values are equal. – Jeppe Stig Nielsen Aug 29 '17 at 09:18
  • You made me read some of the source code, and I found a __bug__ instead! If `comparer` returns `int.MinValue` and you use `OrderByDescending`, it will try to negate `int.MinValue`, but it stays negative. And the sort goes wrong! From [line 2537](http://referencesource.microsoft.com/#System.Core/System/Linq/Enumerable.cs,2537). I have a short repro that I can post elsewhere. – Jeppe Stig Nielsen Aug 29 '17 at 10:15
  • 1
    @JeppeStigNielsen - is it [this one](https://connect.microsoft.com/VisualStudio/feedback/details/634949)? – Damien_The_Unbeliever Aug 29 '17 at 11:47
  • 1
    @Damien_The_Unbeliever yeah, that's the one. It's fixed in CoreFX and marked as netfx-port-consider, so we may see the fix brought into NetFX at some point: https://github.com/dotnet/corefx/pull/2240 – Jon Hanna Aug 29 '17 at 11:48
  • @Damien_The_Unbeliever Absolutely, that was the one! – Jeppe Stig Nielsen Aug 29 '17 at 20:18
  • 1
    I think the case here is slightly different. `CompareKeys` is declared as `internal abstract int CompareKeys(int index1, int index2);`, i.e. is intended to compare 2 **indices** of the element array. Hence it's safe to assume that if `index1 == index2`, the result is `0` (equal). – Ivan Stoev Aug 30 '17 at 07:38
  • 1
    @IvanStoev What a brilliant observation. In the implementation in the same file, it says `internal override int CompareKeys(int index1, int index2) { int c = comparer.Compare(keys[index1], keys[index2]); [...] }` Here, before the `Compare` call, we could just use the literal `0` if `index1` and `index2` are the same! – Jeppe Stig Nielsen Aug 30 '17 at 13:40
  • 1
    In fact, if we just do `internal override int CompareKeys(int index1, int index2) { if (index1 == index2) return 0; int c = comparer.Compare(keys[index1], keys[index2]); if (c == 0) { if (next == null) return index1 - index2; return next.CompareKeys(index1, index2); } return (descending != (c > 0)) ? 1 : -1; }` we solve it all. We do not ask the `comparer` about something that may take time (may need to check every field on the `TKey`) but must always give `0`. And we do not go on to ask `next` (`ThenBy`) either. That has got to be an improvement. – Jeppe Stig Nielsen Aug 30 '17 at 13:57
  • @JeppeStigNielsen Indeed. I see no reason of not making such optimization. Moreover, it would have obvious benefit in multi key comparison like in your second scenario. But for some reason they didn't make it :) – Ivan Stoev Aug 30 '17 at 15:09
0

In the first case, before the -- separator, they do four comparisons. Two of them compare an entry to itself ("Strings: Bravo versus Bravo"). Why?

Efficiency. Sure it would be possible to check the object isn't itself first, but that means doing an extra operation on every comparison to avoid a case that comes up relatively rarely and which in most cases is pretty cheap (most comparers are). That would be a nett loss.

(Incidentally, I did experiment with just such a change to the algorithm, and when measured it really was an efficiency loss with common comparisons such as the default int comparer).

In the second case, there should not ever be a need for resorting to comparing the Q properties (integers); for there are no duplicates (wrt. ordinal comparison) in the P values, so no tie-breaking from ThenBy should be needed ever. Still we see "Ints: 9 versus 9" twice. Why use the ThenBy comparer with identical arguments?

Who is to say there are no duplicates? The internal comparison was given two things (not necessarily reference types, so short-circuiting on reference identity isn't always an option) and has two rules to follow. The first rule needed a tie-break so the tie-break was done.

The code is designed to work for cases where there can be equivalent values for the first comparison.

If it's known that there won't be equivalent values for the OrderBy then it's for the person who knows that to not use an unnecessary ThenBy, as they are the person who can potentially know that.

Jon Hanna
  • 110,372
  • 10
  • 146
  • 251
  • Maybe this is correct. I tried ordering a sequence of 1000 pairwise distinct entries, in random initial order, with `OrderBy`. On average, that requires about 13391.9 comparisons, out of which about 1310.7 are comparisons of an element to itself. So that is about 10% of the comparisons that are of this type, with a shuffled sequence of length 1000. In comparison, `Array.Sort<>` or equivalently `List<>.Sort`, uses fewer comparisons and _very_ rarely makes a comparison of an entry to itself. I know `OrderBy` has stricter requirements (e.g. sort must be stable) than `Array.Sort<>` method. – Jeppe Stig Nielsen Aug 30 '17 at 13:24
  • However, note the recent comments by Ivan Stoev and myself to the answer here by René Vogt. It appears to be really easy to not test the `comparer` when `index1 == index2`. – Jeppe Stig Nielsen Aug 30 '17 at 13:42
  • @JeppeStigNielsen easy, but not free, but by all means give it a shot. It's been a while now since I looked at that so I can neither rule out that I did something silly in my experiment, or that the changes made since won't have shifted things, but any testing should be probably be done with the default int comparer as that would seem to be the most commonly used. – Jon Hanna Aug 30 '17 at 13:45
-1

Ok, let's see the possibilities here:

  1. T is a value type

    In order to check if it's comparing an item against itself, it first needs to check if both items are the same one. How would you do that?

    You could call Equals first and then CompareTo if the items are not the same. Do you really want to do that? The cost of Equals is going to be roughly the same as comparing so you'd actually be doubling the cost of the ordering for exactly what? OrderBy simply compares all items, period.

  2. T is a reference type

    c# doesn't let you overload only with generic constraints so you'd need to check in runtime if T is a reference type or not and then call a specific implementation that would change the behavior described above. Do you want to incurr in that cost in every case? Of course not.

    If the comparison is expensive, then implement in your comparison logic a reference optimization to avoid incurring in stupid costs when comparing an item to itself, but that choice must be yours. I'm pretty sure string.CompareTo does precisely that.

I hope this makes my answer clearer, sorry for the previous short answer, my reasoning wasnt that obvious.

InBetween
  • 32,319
  • 3
  • 50
  • 90
  • I do not understand. The strings `"Bravo"` and `"Bravo"` are equal already as references. The samme goes with the anonymous type `new { P = "Bravo", Q = 9, }`. If it checked reference equality first (which would be useless when it knows the answer in advance), there would be even _less_ reason to also ask my custom comparer. – Jeppe Stig Nielsen Aug 29 '17 at 07:49
  • @JeppeStigNielsen How would you check to know if you are comparing the same object? You'd do a reference check. The default comparer is probly optimized to return 0 if the reference is the same and only perform a value comparison otherwise (if appropriate). If you don't implement that optimization then you'll incur in additional cost but that's due to your implementation. – InBetween Aug 29 '17 at 07:58
  • @InBetween: But reference equality checks wouldn't be displayed. And I can just repeat Jeppe: compile time string literals are interned strings and *are always equal by reference*. – György Kőszeg Aug 29 '17 at 08:17
  • @JeppeStigNielsen see updated answer, I was really too succulent and hard to understand. – InBetween Aug 29 '17 at 08:35
  • @taffer interned strings have nothing to do with the issue. See updated answer. – InBetween Aug 29 '17 at 08:36
  • I still do not see the relevance of your answer. __I am not saying__ they should check for equality before they call my comparer. As you say, there is no good way they could do that (if `T` was a value type, for example). What I __am__ saying is that they should not duplicate one element from the query and compare the copy to the original. It makes little sense. Consider the simplest of all cases: `new[] { "Alpha", }` queried by `.OrderBy(x => x, coStr)`. Why do they compare `"Alpha"` to `"Alpha"` in that case?! There is only one `"Alpha"` in the array. – Jeppe Stig Nielsen Aug 29 '17 at 09:12
  • @JeppeStigNielsen Ok, take an array of n items and write a general implementation that takes elements two at a time and compares all items with all items without comparing any two items that are the same. Do this avoiding costs greater than simply performing n additional needless checks of an item against itself. – InBetween Aug 29 '17 at 09:16
  • @JeppeStigNielsen My whole point is that the cheapest option is doing the comparison. If the comparisons are expensive then it *your* responsibility to optimize it with a reference check if possible. – InBetween Aug 29 '17 at 09:22
-1

In simple terms in case 1

var coStr = Comparer<string>.Create((x, y) =>
{
    Console.WriteLine($"Strings: {x} versus {y}");
    return string.CompareOrdinal(x, y);
});

We are just comparing the elements there is no condition to ignore if the result is 0. so Console.WriteLine condition is irrespective to the output of comparison. If you change your code like below

var coStr = Comparer<string>.Create((x, y) =>
{
   if (x != y)
      Console.WriteLine($"Strings: {x} versus {y}");
   return string.CompareOrdinal(x, y);
});

Your output will be like

Strings: Bravo versus Alpha
Strings: Bravo versus Charlie

Same thing for the second statement here we are checking the output of both so for string comparison will return 0 then it will go for the in comparison so it will take that one and output the required. Hope it clears your doubt :)

René Vogt
  • 43,056
  • 14
  • 77
  • 99
Vivek Shah
  • 55
  • 8
  • 2
    The OP was asking *why* these lambdas were being called when x==y, not simply how to prevent the `Console.WriteLine` calls which were only inserted to *demonstrate* that they were being called when x==y. – Damien_The_Unbeliever Aug 29 '17 at 09:01
  • That is what I mentioned lambda been called for comparison its entire work to compare two values and return -1,0,1. So in the code, we are just comparing 2 values and with this output. Comparer task is ended. I mentioned x==y condition to explain that compared task is just to compare and send the result. Later we need to use those results to do the other operations. In case 2 it will take the in as first string comparison one will return 0 and Int comariosn will be used. – Vivek Shah Aug 29 '17 at 09:15
  • @Damien_The_Unbeliever does it make sense? – Vivek Shah Aug 29 '17 at 09:24