One of the best ways to answer something like this is to dig into the implementation :)
Notwithstanding some of that optimization magic described in the header of setobject.c
, adding an object into a set reuses hashes from strings where hash()
has already been once called (recall, strings are immutable), or calls the type's hash implementation.
For Unicode/bytes objects, we end up via here to _Py_HashBytes
, which seems to have an optimization for small strings, otherwise it uses the compile-time configured hash function, all of which naturally are somewhat O(n)-ish. But again, this seems to only happen once per string object.
For tuples, the hash implementation can be found here – apparently a simplified, non-cached xxHash.
However, once the hash has been computed, the time complexity for sets should be around O(1).
EDIT: A quick, not very scientific benchmark:
import time
def make_string(c, n):
return c * n
def make_tuple(el, n):
return (el,) * n
def hashtest(gen, n):
# First compute how long generation alone takes
gen_time = time.perf_counter()
for x in range(n):
gen()
gen_time = time.perf_counter() - gen_time
# Then compute how long hashing and generation takes
hash_and_gen_time = time.perf_counter()
for x in range(n):
hash(gen())
hash_and_gen_time = time.perf_counter() - hash_and_gen_time
# Return the two
return (hash_and_gen_time, gen_time)
for gen in (make_string, make_tuple):
for obj_length in (10000, 20000, 40000):
t = f"{gen.__name__} x {obj_length}"
# Using `b'hello'.decode()` here to avoid any cached hash shenanigans
hash_and_gen_time, gen_time = hashtest(
lambda: gen(b"hello".decode(), obj_length), 10000
)
hash_time = hash_and_gen_time - gen_time
print(t, hash_time, obj_length / hash_time)
outputs
make_string x 10000 0.23490356100000004 42570.66158311665
make_string x 20000 0.47143921999999994 42423.284172241765
make_string x 40000 0.942087403 42458.905482254915
make_tuple x 10000 0.45578034300000025 21940.393335480014
make_tuple x 20000 0.9328520900000008 21439.62608263008
make_tuple x 40000 1.8562772150000004 21548.505620158674
which basically says hashing sequences, be they strings or tuples, is linear time, yet hashing strings is a lot faster than hashing tuples.
EDIT 2: this proves strings and bytestrings cache their hashes:
import time
s = ('x' * 500_000_000)
t0 = time.perf_counter()
a = hash(s)
t1 = time.perf_counter()
print(t1 - t0)
t0 = time.perf_counter()
b = hash(s)
t2 = time.perf_counter()
assert a == b
print(t2 - t0)
outputs
0.26157095399999997
1.201999999977943e-06