1

Say we add a group of long strings to a hashset, and then test if some string already exists in this hashset. Is the time complexity going to be constant for adding and retrieving operations? Or does it depend on the length of the strings?

For example, if we have three strings.

s1 = 'abcdefghijklmn'
s2 = 'dalkfdboijaskjd'
s3 = 'abcdefghijklmn'

Then we do:

pool = set()
pool.add(s1)
pool.add(s2)
print s3 in pool # => True
print 'zzzzzzzzzz' in pool # => False

Would time complexity of the above operations be a factor of the string length?

Another question is that what if we are hashing a tuple? Something like (1,2,3,4,5,6,7,8,9)?

I appreciate your help!

==================================

I understand that there are resources around like this one that is talking about why hashing is constant time and collision issues. However, they usually assumed that the length of the key can be neglected. This question asks if hashing still has constant time when the key has a length that cannot be neglected. For example, if we are to judge N times if a key of length K is in the set, is the time complexity O(N) or O(N*K).

Zelun Wang
  • 13
  • 5
  • Short answer: No. Long Answer: It depends on how hashing strings works in python. After some research, it see that individual strings are immutable, and they store their hash value once it's been computed. That cuts down **drastically** on lookup times... and the algorithm python uses is pretty cheap too... – Mark Storer Nov 06 '19 at 18:29

4 Answers4

0

Strictly speaking it depends on the implementation of the hash set and the way you're using it (there may be cleverness that will optimize some of the time away in specialized circumstances), but in general, yes, you should expect that it will take O(n) time to hash a key to do an insert or lookup where n is the size of the key. Usually hash sets are assumed to be O(1), but there's an implicit assumption there that the keys are of fixed size and that hashing them is a O(1) operation (in other words, there's an assumption that the key size is negligible compared to the number of items in the set).

Optimizing the storage and retrieval of really big chunks of data is why databases are a thing. :)

Samwise
  • 68,105
  • 3
  • 30
  • 44
0

Average case is O(1). However, the worst case is O(n), with n being the number of elements in the set. This case is caused by hashing collisions. you can read more about it in here https://www.geeksforgeeks.org/internal-working-of-set-in-python/

ofek T
  • 19
0

One of the best ways to answer something like this is to dig into the implementation :)

Notwithstanding some of that optimization magic described in the header of setobject.c, adding an object into a set reuses hashes from strings where hash() has already been once called (recall, strings are immutable), or calls the type's hash implementation.

For Unicode/bytes objects, we end up via here to _Py_HashBytes, which seems to have an optimization for small strings, otherwise it uses the compile-time configured hash function, all of which naturally are somewhat O(n)-ish. But again, this seems to only happen once per string object.

For tuples, the hash implementation can be found here – apparently a simplified, non-cached xxHash.

However, once the hash has been computed, the time complexity for sets should be around O(1).

EDIT: A quick, not very scientific benchmark:

import time


def make_string(c, n):
    return c * n


def make_tuple(el, n):
    return (el,) * n


def hashtest(gen, n):
    # First compute how long generation alone takes
    gen_time = time.perf_counter()
    for x in range(n):
        gen()
    gen_time = time.perf_counter() - gen_time

    # Then compute how long hashing and generation takes
    hash_and_gen_time = time.perf_counter()
    for x in range(n):
        hash(gen())
    hash_and_gen_time = time.perf_counter() - hash_and_gen_time

    # Return the two
    return (hash_and_gen_time, gen_time)


for gen in (make_string, make_tuple):
    for obj_length in (10000, 20000, 40000):
        t = f"{gen.__name__} x {obj_length}"
        # Using `b'hello'.decode()` here to avoid any cached hash shenanigans
        hash_and_gen_time, gen_time = hashtest(
            lambda: gen(b"hello".decode(), obj_length), 10000
        )
        hash_time = hash_and_gen_time - gen_time
        print(t, hash_time, obj_length / hash_time)

outputs

make_string x 10000 0.23490356100000004 42570.66158311665
make_string x 20000 0.47143921999999994 42423.284172241765
make_string x 40000 0.942087403 42458.905482254915
make_tuple x 10000 0.45578034300000025 21940.393335480014
make_tuple x 20000 0.9328520900000008 21439.62608263008
make_tuple x 40000 1.8562772150000004 21548.505620158674

which basically says hashing sequences, be they strings or tuples, is linear time, yet hashing strings is a lot faster than hashing tuples.

EDIT 2: this proves strings and bytestrings cache their hashes:

import time
s = ('x' * 500_000_000)
t0 = time.perf_counter()
a = hash(s)
t1 = time.perf_counter()
print(t1 - t0)
t0 = time.perf_counter()
b = hash(s)
t2 = time.perf_counter()
assert a == b
print(t2 - t0)

outputs

0.26157095399999997
1.201999999977943e-06
AKX
  • 152,115
  • 15
  • 115
  • 172
  • Thanks for the detailed answer! So my understanding now is, computing hash for a string takes O(n) time, but this only happened once during the compile-time (somewhat similar to Java string pool). Thus it is safe to claim that we can ignore the length of the strings during set operations, thus O(1). The tuple is similar, and that's one of the reasons they need to be immutable. – Zelun Wang Nov 06 '19 at 16:44
  • Almost. There are interned strings which are mostly those encountered during parsing code, etc., then there are regular immutable strings, for which `hash` is computed exactly once (but strings are not pooled). Tuples are immutable, but according to a comment near the hashing function, caching their hashes has been found not necessary. – AKX Nov 06 '19 at 16:49
  • Very clear! Thanks! – Zelun Wang Nov 06 '19 at 17:57
-1

Wiki is your friend

https://wiki.python.org/moin/TimeComplexity

for the operations above it seems that they are all O(1) for a set

kosnik
  • 2,342
  • 10
  • 23