1

I am seeking to understand why the amortised worst-case time complexity of looking up an item in a set (item in set), as well as getting an item in a dictionary (d.get(item)), is O(N), while usually (in an average case scenario) it just takes O(1), as the underlying data structure of set and dictionary is hash table.

Existing posts on Stack Overflow related to this topic either doesn't tackle this question, or is too advanced for a beginner like me. A solution on one post wrote that the time complexity of item in set is O(n) when "your hash table's load factor is too high, then you face collisions". I thought that a set didn't allow duplicate values to be added, and it would also ignore any duplicate values created initially. How does a collision happen in a set and how exactly does that make the time complexity go from O(1) to O(n)? How about in a dictionary?

What is the conceptual explanation? What would some simple code examples be?


To explain it simply for people who are also confused about collisions and duplicates like me before this, the values don't need to be duplicates to have hash collision.

For example, "apple" and "banana" are not duplicates, so they can be added to the same set, but depending on the hash function, the hash values that are computed for "apple" and for "banana" separately may be the same, which leads to a hash collision.

The answers and comments below explain why a hash collision leads to longer searching time, and to O(N) in the worst case when every value in a set collides.

  • As a start, the collision referenced is relating to a collision occurring in the hash table, rather than the set itself. – S3DEV Jul 09 '23 at 19:24
  • 1
    Collision means hash collision. Assume all n items in your set have the same hash value (coincidence or bad hash function, whatever). They will all end up in the same hash bucket. Depending on implementation, that might just be a list that has to be linearly searched. It might also mean that there is logic to fallback to a different location, but that too would involve falling back n times for a lookup. – user2390182 Jul 09 '23 at 19:28
  • Right. A hash function takes arbitrarily long things and converts them to something that can be used as a table index. By necessity, when you're converting larger things to a smaller thing, some of the input things will produce the same hash. The same thing occurs with algorithms like MD5. Several inputs can produce the same output. – Tim Roberts Jul 09 '23 at 21:06
  • The link near "A solution" just points to Stack Overflow. What was the intent? – Peter Mortensen Jul 23 '23 at 10:07
  • @PeterMortensen I fixed the link! Sorry about that and thank you for pointing out! – Autumn Nguyen Jul 25 '23 at 17:11

2 Answers2

2

The collision is mainly due to the functions that Python uses for hashing the values in a set or dictionary. In hashing, a hash function is used to map an item to a unique hash value. Ideally, each item should have a unique hash value, and the hash table (used internally by sets and dictionaries) would have one item per bucket. However, due to the nature of hash functions and the finite number of possible hash values, collisions can occur.

For example, let's say that your set/dict have two items, "apple" and "banana", and their hash values are calculated to be the same. In such cases, the hash table needs to handle these collisions and resolve them.

Python's implementation of sets and dictionaries uses a technique called separate chaining to handle hash collisions. Each bucket in the hash table contains a linked list of items that have collided. When a collision occurs, the new item is appended to the linked list in that bucket.

Now, when you perform a lookup operation on a set or dictionary, Python uses the hash value of the item you're searching for to locate the corresponding bucket. It then traverses the linked list in that bucket to find the exact match. If there are multiple items in the linked list, Python needs to iterate through them until it finds the desired item or reaches the end of the list.

Therefore, in the worst-case scenario, all items in your set/dict are having the same hash value, and then the function will need to search through that single bucket containing all items in your set/dict, and that's why the complexity is O(n). Again this is the worst-case scenario and it's very rare.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
TYZ
  • 8,466
  • 5
  • 29
  • 60
  • 2
    Good writeup. "rare" is an understatement. You would need maliciously chosen data, or very badly implemented `__hash__` on the stored objects – user2390182 Jul 09 '23 at 19:35
  • 1
    "Python's implementation of sets and dictionaries uses a technique called separate chaining to handle hash collisions" where are you getting that information from? As far as I know, open addressing is used, [at least according to the comments in the source code](https://github.com/python/cpython/blob/ee46cb6aa959d891b0a480fea29f1eb991e0fad8/Objects/setobject.c#L2) – juanpa.arrivillaga Jul 09 '23 at 19:40
-3

To achieve the desired result with dictionaries, you can use nested loops to iterate over the keys and values of the original dictionary and compute the product of values. However, you need to keep track of the combinations that have already been computed to avoid redundant entries.

Here's an example of how you can accomplish this:

d = {'a': 0, 'b': 1, 'c': 2}
result = {}

for key1, value1 in d.items():
    for key2, value2 in d.items():
        if key1 <= key2:  # To avoid redundant computations
            new_key = key1 + '#' + key2
            result[new_key] = value1 * value2

print(result)

Output:

{'a#a': 0, 'a#b': 0, 'a#c': 0, 'b#b': 1, 'b#c': 2, 'c#c': 4}

In this code, the outer loop iterates over the keys and values of the dictionary d. The inner loop iterates from the current key to the end of the dictionary to avoid redundant computations (e.g., avoiding b#a when a#b has already been computed). The product of the values is calculated and stored in the result dictionary with the appropriate key.

By checking key1 <= key2, we ensure that each combination is computed only once. This condition guarantees that the ordering of the keys doesn't affect the result, preventing duplicates such as a#b and b#a.

I hope this helps! Let me know if you have any further questions.

  • 3
    This answer looks like ChatGPT – DavidW Jul 10 '23 at 14:28
  • 1
    This answer looks like it was generated by an AI (like ChatGPT), not by an actual human being. You should be aware that [posting AI-generated output is officially **BANNED** on Stack Overflow](https://meta.stackoverflow.com/q/421831). If this answer was indeed generated by an AI, then I strongly suggest you delete it before you get yourself into even bigger trouble: **WE TAKE PLAGIARISM SERIOUSLY HERE.** Please read: [Why posting GPT and ChatGPT generated answers is not currently allowed](https://stackoverflow.com/help/gpt-policy). – tchrist Jul 11 '23 at 00:12