1

The piece of code below filters an IEnumerable<T> with another, used as a blacklist. The filtered collection iterates over content fetched remotely (lazy loading, YouTube API).

IEnumerable<CustomType> contentThatCanBeHuge = this.FetchContentThatCanBeHuge();
IEnumerable<string> blackListContent = this.FetchBlackListContent();
return contentThatCanBeHuge.Where(x => !blackListContent.Contains(x.Id));

The Enumerable.Contains method is O(n) in time complexity, so the Enumerable.Where call could take a while.

In the other hand, HashSet<T>.Contains is O(1). Instantiating a HashSet<T> from an IEnumerable<T> seems to be O(n).

If the blacklist is about to be used multiple times, and without taking space complexity into account, is it a good approach to turn it into a HashSet<T> before using it or is this just premature optimization?

Amessihel
  • 5,891
  • 3
  • 16
  • 40

1 Answers1

4

Let size of blackListContent be m and size of contentThatCanBeHuge is n.

If we don't use HashSet, time complexity is O(n * O(m)) = O(n * m), space complexity is O(1): for each item in contentThatCanBeHuge we should scan entire blackListContent.

If we use HashSet, time complexity is O(m) + O(n * O(1)) = O(n + m), space complexity is O(m):

  1. We create HashSet - O(m) time complexity, O(m) space complexity.
  2. For each item in contentThatCanBeHuge we should check it with rspect of HashSet - O(n * O(1)) time complexity.

So far so good HashSet makes the code faster but we consumes more memory.

Dmitry Bychenko
  • 180,369
  • 20
  • 160
  • 215
  • That's what I thought, thanks. The choice will depend on which is more costly: storing the data in memory or iterating it each time. And since the data of the blacklist is retrieved remotely, it _might be better_ to store it in memory. – Amessihel Aug 03 '23 at 18:00
  • 2
    *Might be better* is an understatement. The circumstances would be rather exceptional if they forced you to choose an `O(nm)` solution over an `O(n+m)` one, *especially* if the `IEnumerable` is fetched via lazy loading. Depending on the implementation, it may not even support multiple enumeration. – Mo B. Aug 03 '23 at 19:02