15

Is there a fast built-in way to check if an IEnumerable<string> contains only distinct strings?

In the beginning I started with:

var enumAsArray = enum.ToArray();
if (enumAsArray.Length != enumAsArray.Distinct().Count())
    throw ...

However, this looks like it is O(2n) - is it? ToArray() might be O(1)?

This looks faster:

var set = new HashSet<string>();
foreach (var str in enum)
{
    if (!set.Add(str))
        throw ...
}

This should be O(n), however, is there a built-in way too?

Edit: Maybe Distinct() uses this internally?


Solution: After considering all the comments and the answer, I wrote an extension method for my second solution, as this seems to be the fastest version and the most readable too:

public static bool ContainsDuplicates<T>(this IEnumerable<T> e)
{
    var set = new HashSet<T>();
    // ReSharper disable LoopCanBeConvertedToQuery
    foreach (var item in e)
    // ReSharper restore LoopCanBeConvertedToQuery
    {
        if (!set.Add(item))
            return true;
    }
    return false;
}
Marc
  • 3,905
  • 4
  • 21
  • 37
D.R.
  • 20,268
  • 21
  • 102
  • 205
  • 1
    What you have isn't O(n), as `set.Add` isn't O(1) (as mentioned in the docs, it may require reallocations). However, `new HashSet(enum)` is O(n), and you can directly read `Count` after that. –  Sep 14 '13 at 16:44
  • However, on average, I should be faster using the second method, as it breaks on the first duplicate? – D.R. Sep 14 '13 at 16:46
  • Why is there no new HashSet(enum.Count()) constructor? – D.R. Sep 14 '13 at 16:47
  • Sure, your `foreach` may be faster for some inputs. That'll be the case if the extra allocations take less time than the skipped comparisons. But I don't know if that'll be true on average. Your suggested `HashSet(int)` constructor would be useful, but I don't think there is such a thing. –  Sep 14 '13 at 16:50
  • Why are you converting it to an array? Could you not just do `myEnumerable.Distinct().Count() != myEnumerable.Count()`? – keyboardP Sep 14 '13 at 16:57
  • @keyboardP: possible multiple enumeration. – D.R. Sep 14 '13 at 16:58
  • @D.R. Ah, fair enough. – keyboardP Sep 14 '13 at 16:58
  • @hvd: instead of using HashSet, do you think it is a good idea to use Dictionary and just use the keys? Dictionary provides a ctor with initial size. – D.R. Sep 14 '13 at 16:58
  • 2
    @D.R. That's a pretty darn small optimization in exchange for using a semantically inappropriate data structure. Doesn't really seem worthwhile in general. – Servy Sep 14 '13 at 17:03
  • 1
    I don't know, if you care about performance you're probably dealing with large sets of data, and I don't see how Dictionary could use less memory than HashSet. Performance questions are difficult to answer usefully without a lot of knowledge of the type of data you'll end up using. –  Sep 14 '13 at 17:03
  • I have edited your title. Please see, "[Should questions include “tags” in their titles?](http://meta.stackexchange.com/questions/19190/)", where the consensus is "no, they should not". – John Saunders Sep 14 '13 at 17:24
  • `HashSet.Add` is amortized O(1) as long as the hash is big and good enough. – CodesInChaos Sep 14 '13 at 17:27
  • O(n)=O(2n). And for an arbitrary enumerable it's not possible to be faster than that (up to a constant factor) since you need to look at each element at least once if there are no duplicates. – CodesInChaos Sep 14 '13 at 17:28
  • @CodesInChaos Although that's true, I still want to keep the constant factor to a minimum. – D.R. Sep 14 '13 at 17:31
  • then replace the `.Any` by a `foreach` loop or at least use `!All(set.Add)`. I prefer the `foreach` loop since I don't like LINQ with side-effects. – CodesInChaos Sep 14 '13 at 17:35
  • Yup, I need to replace it with a foreach loop due to hvd's comment on the accepted answer. `!All` kills the short-circuit once more. – D.R. Sep 14 '13 at 17:41
  • @D.R. `All` does *not* prevent short circuiting. It can stop iterating as soon as it hits a `false` value, in just the same way that `Any` can stop the second it hits a true one. – Servy Sep 16 '13 at 13:43
  • @D.R. rather than editing a solution into the question (where it doesn't belong), you should write that solution as a separate answer. – psubsee2003 Dec 22 '13 at 10:03

3 Answers3

9

Your second code sample is short, simple, clearly effective, and if not the completely perfect ideal solution, is clearly rather close to it. It seems like a perfectly acceptable solution to your particular problems.

Unless your use of that particular solution is shown to cause performance problems after you've noticed issues and done performance testing, I'd leave it as is. Given how little room I can see for improvement in general, that doesn't seem likely. It's not a sufficiently lengthy or complex solution that trying to find something "shorter" or more concise is going to be worth your time and effort.

In short, there are almost certainly better places in your code to spend your time; what you have already is fine.

To answer your specific questions:

  1. However, this looks like it is O(2n) - is it?

    Yes, it is.

  2. ToArray() might be O(1)?

    No, it's not.

  3. Maybe Distinct() uses this internally?

    It does use a HashSet, and it looks pretty similar, but it simply ignores duplicate items; it doesn't provide any indication to the caller that it has just passed a duplicate item. As a result, you need to iterate the whole sequence twice to see if it removed anything, rather than stopping when the first duplicate is encountered. This is the difference between something that always iterates the full sequence twice and something that might iterate the full sequence once, but can short circuit and stop as soon as it has ensured an answer.

  4. is there a built-in way too?

    Well, you showed one, it's just not as efficient. I can think of no entire LINQ based solution as efficient as what you showed. The best I can think of would be: data.Except(data).Any(). This is a bit better than your distinct compared to the regular count in that the second iteration can short circuit (but not the first) but it also iterates the sequence twice, and still is worse than your non-LINQ solution, so it's still not worth using.

Servy
  • 202,030
  • 26
  • 332
  • 449
  • 1
    @D.R. If you want to share the slightly improved method that you have just edit the code in your question; it doesn't change the question itself in any real way. As it is, it's not an appropriate thing to edit into another person's answer. – Servy Sep 14 '13 at 17:19
  • Isn't it confusing for future visitors to find the solution in the question instead of the answer? – D.R. Sep 14 '13 at 17:20
  • @D.R. But it's *not* a solution. It's basically the exact same as your original code. – Servy Sep 14 '13 at 17:20
  • 1
    @D.R. BTW, the `ContainsDuplicates` method that you added to this answer is not guaranteed to work. It relies on a specific implementation of `Any`, that does not evaluate the predicate more than once for the same item. I cannot imagine why an implementation might evaluate it twice, but it's allowed, and there's no need to make unwarranted assumptions since you've already got a `foreach` loop that *is* guaranteed to work. –  Sep 14 '13 at 17:20
  • 1
    @D.R. The library implementation of `Any` wouldn't do that, and an implementation of the method that did that would be a very poor implementation, it's just that you're relying on something that the public API doesn't technically guarantee, even though it's technically true. – Servy Sep 14 '13 at 17:23
2

Here is a possible refinement to the OP's answer:

public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> e)
{
    var set = new HashSet<T>();
    // ReSharper disable LoopCanBeConvertedToQuery
    foreach (var item in e)
    // ReSharper restore LoopCanBeConvertedToQuery
    {
        if (!set.Add(item))
            yield return item;
    }
}

You now have a potentially useful method to get the actual duplicate items and you can answer your original question with:

collection.Duplicates().Any()
OldFart
  • 2,411
  • 15
  • 20
2

Just a complement to the existing solution:

public static bool ContainsDuplicates<T>(this IEnumerable<T> items)
{
    return ContainsDuplicates(items, EqualityComparer<T>.Default);
}

public static bool ContainsDuplicates<T>(this IEnumerable<T> items, IEqualityComparer<T> equalityComparer)
{
    var set = new HashSet<T>(equalityComparer);

    foreach (var item in items)
    {
        if (!set.Add(item))
            return true;
    }

    return false;
}

This version lets you pick an equality comparer, this may prove useful if you want to compare items based on non-default rules.

For instance, to compare a a set of strings case insensitively, just pass it StringComparer.OrdinalIgnoreCase.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158