I am seeking to understand why the amortised worst-case time complexity of looking up an item in a set (item in set
), as well as getting an item in a dictionary (d.get(item)
), is O(N), while usually (in an average case scenario) it just takes O(1), as the underlying data structure of set and dictionary is hash table.
Existing posts on Stack Overflow related to this topic either doesn't tackle this question, or is too advanced for a beginner like me. A solution on one post wrote that the time complexity of item in set
is O(n) when "your hash table's load factor is too high, then you face collisions". I thought that a set didn't allow duplicate values to be added, and it would also ignore any duplicate values created initially. How does a collision happen in a set and how exactly does that make the time complexity go from O(1) to O(n)? How about in a dictionary?
What is the conceptual explanation? What would some simple code examples be?
To explain it simply for people who are also confused about collisions and duplicates like me before this, the values don't need to be duplicates to have hash collision.
For example, "apple" and "banana" are not duplicates, so they can be added to the same set, but depending on the hash function, the hash values that are computed for "apple" and for "banana" separately may be the same, which leads to a hash collision.
The answers and comments below explain why a hash collision leads to longer searching time, and to O(N) in the worst case when every value in a set collides.