Is it a good practice to instantiate a HashSet from an IEnumerable before using Contains()?

Question

The piece of code below filters an IEnumerable<T> with another, used as a blacklist. The filtered collection iterates over content fetched remotely (lazy loading, YouTube API).

IEnumerable<CustomType> contentThatCanBeHuge = this.FetchContentThatCanBeHuge();
IEnumerable<string> blackListContent = this.FetchBlackListContent();
return contentThatCanBeHuge.Where(x => !blackListContent.Contains(x.Id));

The Enumerable.Contains method is O(n) in time complexity, so the Enumerable.Where call could take a while.

In the other hand, HashSet<T>.Contains is O(1). Instantiating a HashSet<T> from an IEnumerable<T> seems to be O(n).

If the blacklist is about to be used multiple times, and without taking space complexity into account, is it a good approach to turn it into a HashSet<T> before using it or is this just premature optimization?

@TheodorZoulias, very variable, `contentThatCanBeHuge` can enumerate all videos from a YouTube channel. — Amessihel, Aug 03 '23 at 17:16
What is the type of the `Id` property? Does this type have an efficient `GetHashCode` implementation? — Theodor Zoulias, Aug 03 '23 at 17:19
The answer is no. Even if you do find that this is on the critical performance path, hashing may not necessarily be the best option to make it faster. — 500 - Internal Server Error, Aug 03 '23 at 17:25
@500-InternalServerError, thanks. Considering Dmitry's answer, could you elaborate? — Amessihel, Aug 03 '23 at 18:01
The answer is yes. We are talking about different complexity classes here, so definitely *not* premature optimization, and definitely *not* something that we need to measure. Assuming that the blacklist is not so huge that memory would be an issue, I would consider it a bug if I were to see a colleague write such code as above. — Mo B., Aug 03 '23 at 18:57
The `x.Id` does not make any sense if `contentThatCanBeHuge` contains strings. — Mo B., Aug 03 '23 at 19:03

score 4 · Accepted Answer · answered Aug 03 '23 at 17:51

4

Let size of blackListContent be m and size of contentThatCanBeHuge is n.

If we don't use HashSet, time complexity is O(n * O(m)) = O(n * m), space complexity is O(1): for each item in contentThatCanBeHuge we should scan entire blackListContent.

If we use HashSet, time complexity is O(m) + O(n * O(1)) = O(n + m), space complexity is O(m):

We create HashSet - O(m) time complexity, O(m) space complexity.
For each item in contentThatCanBeHuge we should check it with rspect of HashSet - O(n * O(1)) time complexity.

So far so good HashSet makes the code faster but we consumes more memory.

answered Aug 03 '23 at 17:51

Dmitry Bychenko

180,369
20
160
215

That's what I thought, thanks. The choice will depend on which is more costly: storing the data in memory or iterating it each time. And since the data of the blacklist is retrieved remotely, it _might be better_ to store it in memory. – Amessihel Aug 03 '23 at 18:00
2

*Might be better* is an understatement. The circumstances would be rather exceptional if they forced you to choose an `O(nm)` solution over an `O(n+m)` one, *especially* if the `IEnumerable` is fetched via lazy loading. Depending on the implementation, it may not even support multiple enumeration. – Mo B. Aug 03 '23 at 19:02

Is it a good practice to instantiate a HashSet from an IEnumerable before using Contains()?

1 Answers1