13

I have two multisets, both IEnumerables, and I want to compare them.

string[] names1 = { "tom", "dick", "harry" };
string[] names2 = { "tom", "dick", "harry", "harry"};
string[] names3 = { "tom", "dick", "harry", "sally" };
string[] names4 = { "dick", "harry", "tom" };

Want names1 == names4 to return true (and self == self returns true obviously)
But all other combos return false.

What is the most efficient way? These can be large sets of complex objects.

I looked at doing:
var a = name1.orderby<MyCustomType, string>(v => v.Name);
var b = name4.orderby<MyCustomType, string>(v => v.Name);

return a == b;

finnw
  • 47,861
  • 24
  • 143
  • 221
dFlat
  • 819
  • 7
  • 19
  • Similar question for Java: http://stackoverflow.com/questions/1565214/is-there-a-way-to-check-if-two-collections-contain-the-same-elements-independent – finnw Feb 04 '11 at 12:04
  • possible duplicate of [Comparing two collections for equality irrespective of the order of items in them](http://stackoverflow.com/questions/50098/comparing-two-collections-for-equality-irrespective-of-the-order-of-items-in-the) – nawfal Nov 09 '13 at 00:06

4 Answers4

14

First sort as you have already done, and then use Enumerable.SequenceEqual. You can use the first overload if your type implements IEquatable<MyCustomType> or overrides Equals; otherwise you will have to use the second form and provide your own IEqualityComparer<MyCustomType>.

So if your type does implement equality, just do:

return a.SequenceEqual(b);

Here's another option that is both faster, safer, and requires no sorting:

public static bool UnsortedSequencesEqual<T>(
    this IEnumerable<T> first,
    IEnumerable<T> second)
{
    return UnsortedSequencesEqual(first, second, null);
}

public static bool UnsortedSequencesEqual<T>(
    this IEnumerable<T> first,
    IEnumerable<T> second,
    IEqualityComparer<T> comparer)
{
    if (first == null)
        throw new ArgumentNullException("first");

    if (second == null)
        throw new ArgumentNullException("second");

    var counts = new Dictionary<T, int>(comparer);

    foreach (var i in first) {
        int c;
        if (counts.TryGetValue(i, out c))
            counts[i] = c + 1;
        else
            counts[i] = 1;
    }

    foreach (var i in second) {
        int c;
        if (!counts.TryGetValue(i, out c))
            return false;

        if (c == 1)
            counts.Remove(i);
        else
            counts[i] = c - 1;
    }

    return counts.Count == 0;
}
cdhowie
  • 158,093
  • 24
  • 286
  • 300
  • 2
    Be careful when you have a complex object: If items in the sorted sequence have equal keys, they may appear in any order in the sorted sequence, and if the equality comparer doesn't consider them equal, it may give you incorrect results. You should make sure that the equality comparer used works exactly the way the OrderBy comparer does. – Mehrdad Afshari Jan 02 '11 at 02:07
  • 1
    @Mehrdad: A good point. The second approach I provide should not be affected by this possibility. – cdhowie Jan 02 '11 at 02:10
  • What makes you think the `List` method will be faster? Do you have benchmarks? I would expect it to be slower because it's O(n^2) while sorting and doing `SequenceEqual` is O(n lg n). – Gabe Jan 02 '11 at 02:13
  • @Gabe: In the limited examples he has provided, sorting will be more expensive overall than a linear search. Of course, if performance is a huge problem then either sorting or hashtables could be used to do the comparison. – cdhowie Jan 02 '11 at 02:15
  • @Gabe: I've tossed that idea anyway in favor of an approach that should perform quite well in most scenarios. See my updated example. It will make only one pass through each array. – cdhowie Jan 02 '11 at 02:20
  • Sorting would then be more *scalable* even if the performance is lower with smaller lists. – Rei Miyasaka Jan 02 '11 at 02:28
  • Your new algorithm is much better, but http://stackoverflow.com/questions/4576723/c-and-linq-want-1-1-2-3-1-2-3-1-returns-true-but-1-1-2-3-1-2-3-re/4576854#4576854 is much more elegant. I'm curious what the performance penalty is for the elegance. – Gabe Jan 02 '11 at 07:08
  • @Gabe: I'm not familiar with how lookups are implemented, but in the second option of that answer the memory consumed will be double that of my answer since it builds up two dictionaries and compares the values on each instead of building up one dictionary and then tearing it down. The grouping might take longer too, again depending on how it's implemented. – cdhowie Jan 02 '11 at 12:16
  • Yep, I expect your answer to be faster, but not by much... I'll try it. I'm guessing this answer is probably the fastest solution possible without loss of generality. – Eamon Nerbonne Jan 05 '11 at 12:40
  • 1
    +1 it's quite a bit faster: for 1000000 ints it takes 0.4 vs. 1.9 seconds; for short strings it's 1s vs. 3s, and for a bit more complex objects it's 2.1s vs. 4.7s. For small lists (10000 elements) the difference is much smaller; it's probably a memory allocation difference indeed. – Eamon Nerbonne Jan 05 '11 at 12:55
  • This answer is great! Is there an existing package that includes this or a solution like it? I'm thinking of MoreLINQ which I don't believe has this method. Would you be willing to contribute your solution to such a project? – meustrus Oct 04 '16 at 16:43
11

The most efficient way would depend on the datatypes. A reasonably efficient O(N) solution that's very short is the following:

var list1Groups=list1.ToLookup(i=>i);
var list2Groups=list2.ToLookup(i=>i);
return list1Groups.Count == list2Groups.Count 
   && list1Groups.All(g => g.Count() == list2Groups[g.Key].Count());

The items are required to have a valid Equals and GetHashcode implementation.

If you want a faster solution, cdhowie's solution below is comparably fast @ 10000 elements, and pulls ahead by a factor 5 for large collections of simple objects - probably due to better memory efficiency.

Finally, if you're really interested in performance, I'd definitely try the Sort-then-SequenceEqual approach. Although it has worse complexity, that's just a log N factor, and those can definitely be drowned out by differences in the constant for all practical data set sizes - and you might be able to sort in-place, use arrays or even incrementally sort (which can be linear). Even at 4 billion elements, the log-base-2 is just 32; that's a relevant performance difference, but the difference in constant factor could conceivably be larger. For example, if you're dealing with arrays of ints and don't mind modifying the collection order, the following is faster than either option even for 10000000 items (twice that and I get an OutOfMemory on 32-bit):

Array.Sort(list1);
Array.Sort(list2);
return list1.SequenceEqual(list2);

YMMV depending on machine, data-type, lunar cycle, and the other usual factors influencing microbenchmarks.

Community
  • 1
  • 1
Eamon Nerbonne
  • 47,023
  • 20
  • 101
  • 166
  • 1
    Clean and elegant. I like the lookup version too. If benchmarks show a difference I will update here. Seperatley, are you all pulling the big-O notations seen here from experience/feel or is this documented somewhere? – dFlat Jan 02 '11 at 07:38
  • Interestingly, it's not well specified: http://msdn.microsoft.com/en-us/library/bb353368.aspx. However, there's an obvious implemtation of ToLookup using a hashtable which would be `O(N)` - presumably that's used. In fact, ToLookup uses an `IEqualityComparer` that only provides `Equals` and `GetHashCode`; it's pretty much forced to use a hashtable under the covers. Then it's linear iteration over `list1` and `list2` and some more linear iteration over the (smaller) lookups and their values (equal to those in the lists). By experience it's fast, and by common sense it almost has to be O(N). – Eamon Nerbonne Jan 05 '11 at 12:23
  • 1
    If you are comparing the count of each collection to be the same, a performance enhancement, specially if you are going to run the enumeration comparison on several consecutive cycles and the collections are large, is to have the count comparison of the original lists before creating the Lookup. if they are indeed different you can fail the comparison earlier before wasting cycles on the lookups. – Mauricio Quintana Jun 20 '16 at 21:09
1

You could use a binary search tree to ensure that the data is sorted. That would make it an O(log N) operation. Then you can run through each tree one item at a time and break as soon as you find a not equal to condition. This would also give you the added benefit of being able to first compare the size of the two trees since duplicates would be filtered out. I'm assuming these are treated as sets, whereby {"harry", "harry"} == {"harry").

If you are counting duplicates, then do a quicksort or a mergesort first, that would then make your comparison operation an O(N) operation. You could of course compare the size first, as two enums cannot be equal if the sizes are different. Since the data is sorted, the first non-equal condition you encounter would render the entire operation as "not-equal".

jamesmortensen
  • 33,636
  • 11
  • 99
  • 120
  • OP indicates that `names1` and `names2` are not to be considered equal. So this is not simple set equality (duplicate elements matter). – cdhowie Jan 02 '11 at 02:03
  • @cdhowie: Maybe you could attach a count to each element in the tree, and check that? – wj32 Jan 02 '11 at 02:04
0

@cdhowie's answer is great, but here's a nice trick that makes it even better for types that declare .Count by comparing that value prior to decomposing parameters to IEnumerable. Just add this to your code in addition to his solution:

    public static bool UnsortedSequencesEqual<T>(this IReadOnlyList<T> first, IReadOnlyList<T> second, IEqualityComparer<T> comparer = null)
    {
        if (first.Count != second.Count)
        {
            return false;
        }

        return UnsortedSequencesEqual((IEnumerable<T>)first, (IEnumerable<T>)second, comparer);
    }
Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125