0

The index that a key is associated with is generally, in the most simplest implementation of a hash table, retrieved in the following way:

size++;
int hash = hashcode(key);
int index = hash % size;

For an arbitrary key we can say that the index will be an integer in the range [0, size - 1] with equal probability for each outcome. These probabilities are described by the table below for the first 5 indices after adding N elements.

Index            |   0              1               2              3                4
--------------------------------------------------------------------------------------------
Probabilities    |  1
                 |  1/2             1/2
                 |  1/3             1/3             1/3
                 |  1/4             1/4             1/4             1/4
                 |  1/5             1/5             1/5             1/5             1/5    
                 |                                    ...
                 |  1/N             1/N             1/N             1/N             1/N
____________________________________________________________________________________________
Total            |  H(N)         H(N) - 1        H(N) - 1.5      H(N) - 1.83     H(N) - 2.08

H(N) describes how many elements should collect in the chain at index 0. Every chain afterwards should have statistically fewer elements.

H(N) is also the value of the harmonic series up to and including the Nth term. Although there is no generalized closed form for describing the harmonic series, this value can be approximated very accurately using the following formula,

H(N) ≈ ln(N) + 0.5772156649 + 1 / (2N) - 1 / (12N^2)

Reference: https://math.stackexchange.com/questions/496116/is-there-a-partial-sum-formula-for-the-harmonic-series

The "approximation" part can be attributed to the terms after ln(N) + 0.5772156649. ln(N) is the largest function and thus the amortized time complexity should be O(log n).

Is there something I am missing? I would greatly appreciate clarification here.

Bill Baits
  • 82
  • 9
  • 1
    Surely there needs to be at least two numbers involved in this calculation: the number of elements you are storing, and the number of buckets in the hashtable. The larger the number of buckets, the shorter the expected length of the chains. Where is the number of buckets in your formula? – khelwood Nov 11 '20 at 00:19
  • 1
    Does this answer your question? [Is a Java hashmap search really O(1)?](https://stackoverflow.com/questions/1055243/is-a-java-hashmap-search-really-o1) – Aziz Sonawalla Nov 11 '20 at 00:22
  • 1
    You're also missing the fact that hashtables can be resized at any time to ensure that only a small percentage of indices are taken (and subsequently reduce chaining at each index) – Aziz Sonawalla Nov 11 '20 at 00:23
  • 1
    Saying "algorithm X has complexity Y" is ambiguous. There are (at least) three complexities: worst case, expected case and best case. A typical hashtable's worst case get (i.e. all keys have the same hash) is O(n). Best case is typically O(1) (i.e. all keys have different hash). Expected case depends on expected hash distribution and number of buckets relative to n. Infinite buckets will give O(1) while 1 bucket will give O(n). Your assumption that number of buckets is n+1 is not typical for the algorithm. – sprinter Nov 11 '20 at 00:29
  • @khelwood The number of buckets is `size` in this very simplistic implementation. – Bill Baits Nov 11 '20 at 01:29
  • What I mean is, whatever the time complexity may be in terms of just n is not informative, because `size` would be increased to adjust for it. – khelwood Nov 11 '20 at 01:33
  • @sprinter Yes, I should have better specified the type of complexity I am referring to which is **amortized** or expected time for the average case. – Bill Baits Nov 11 '20 at 01:45
  • @AzizSonawalla Good find since the top answer to that question is very relevant here. I find it problematic that he just caps the constant in `O(k)` to a large enough value of k after which the probability of a collision is negligible and concludes that the Big O reduces to `O(1)`. I find this to be a sort of cheat around recognizing it as `O(log n)` - surely log n and constant time performance are *identical* for a large enough value of `n`, they only differ in behavior for small values of `n`. I simply defined that probabilistic growth in a closed form which happened to reduce to `O(log n)`. – Bill Baits Nov 11 '20 at 02:14
  • Just to make sure I’m following your question correctly - the approach you’ve outlined for implementing a hash table at the very top isn’t how most hash tables are implemented. Is your question “given this particular implementation, what is the amortized cost per insertion?,” or is it “I’m pretty sure this is how hash tables work, and under this assumption I’m getting a different answer for the analysis than what I’m normally hearing it should be?” – templatetypedef Nov 11 '20 at 17:35
  • @templatetypedef Yes, my question is the latter. More specifically, if that is *not* a standard implementation of hash tables I would like to see an implementation that would have an `O(1)` amortized lookup complexity. Some community members above have pointed out that the hash tables resize accordingly to lower collision likelihood but I would like to see a specific implementation that you assert is constant time amortized. You don't have to prove that it's constant time, although that would be nice, I can leave that as an exercise to myself and edit my question Q/A style. – Bill Baits Nov 11 '20 at 19:14

1 Answers1

1

Expanding my comment into an answer - let's start here:

The index that a key is associated with is generally, in the most simplest implementation of a hash table, retrieved in the following way:

size++;
int hash = hashcode(key);
int index = hash % size;

This actually isn't how most hash tables are implemented. Rather, most hash tables use a strategy like the following:

  1. Pick some fixed starting number of slots (say, 8, or 16, etc.)
  2. When an element is added, place that element in slot hashcode(key) % tablesize.
  3. Once the ratio of the number of items in the table to the number of slots exceeds a threshold called the load factor, perform a rehash: double the size of the table and redistribute the existing elements by recomputing hashcode(key) % tablesize with the new table size. (This last step ensures that items can still be found given that the table has been resized, and ensures that the items are distributed across the whole table, not just the first few slots.)

The exact analysis of how fast this is will depend on how you implement the hash table. If you use chained hashing (each item is dropped into a slot and then stored in a separate array or linked list containing all the items in that slot) and your hash function is "more or less" uniformly random, then (intuitively) the items will probably be distributed more or less uniformly across the table slots. You can do a formal analysis of this by assuming you indeed have a random hash function, in which case the expected number of items in any one slot is at most the table's load factor (the ratio of the number of items to the number of slots). The load factor is typically chosen to be a constant, which means that the expected number of items per slot is upper-bounded by that constant, hence the claim about O(1) expected lookup times. (You can also get similar O(1) bounds on the expected costs of lookups for linear probing hash tables, but the math is much more involved.)

The "amortized" part comes in because of the rehashing step. The idea is that, most of the time, insertions don't push the load factor above the threshold needed for a rehash, so they're pretty fast. But every now and then you do have to rebuild the table. Assuming you double the table's size, you can show that each rehash is proceeded by a linear number of insertions that didn't trigger a rehash, so you can backcharge the work required to do the rehash to the previous operations.

Dharman
  • 30,962
  • 25
  • 85
  • 135
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065