0

I was working on linear probing. Which hashes the values on mod of table size and wrote some code for it.

public class LinearProbing
{
    private int table[];
    private int size;
    LinearProbing(int size)
    {
        this.size=size;
        table=new int[size];
    }
    public void hash(int value)
    {
        int key=value%size;
        while(table[key]!=0)
        {
            key++;
            if(key==size)
            {
                key=0;
            }
        }
        table[key]=value;
    }
    public void display()
    {
        for(int i=0;i<size;i++)
        {
            System.out.println(i+"->"+table[i]);
        }
    }
}

It works fine for every value except zero(0). When zero is in values to be hashed, as in java array each index is initially initiated with zero. Checking with zero to see whether the index is free or not causing trouble if zero is to be hashed and can be overwritten. I also checked with equality with null but it raises an error type mismatch.

Does anyone have any suggestion?

  • In Java an array element is always instantiated with a default value. What if you used a special value for such elements? – AddeusExMachina Jul 25 '22 at 15:49
  • 1
    If you want your array to contain nullable elements, you can use the boxed form of the primitives: `Integer[] table` (note you typically avoid c-style array declarations in java, a la `Integer table[]`). This would initialize the array members to `null` instead of `0`, and then you don't need a sentinel value/number, at a potentially negligible performance cost. In java, `int` cannot be `null`, but `Integer` can (same for `double` and `Double`, and so on) – Rogue Jul 25 '22 at 15:57
  • @AddeusExMachina I have tried that approach you suggested. But In my case, I am using a Integer array as my hash table. We can keep a specific value as an indicator for an empty slot in the Integer range. But doing so we can't hash that value kept because everytime that slot were our element is hashed indicates empty and can be overidden which is the case with zero in my question. – BhanuPrakashSakkuri Jul 26 '22 at 16:06

1 Answers1

1

Computers don't work that way, at least, not without paying a rather great cost.

Specifically, a new int[10] quite literally just creates a contiguous block of memory that is precisely large enough to hold 10 int variables, and not a bit more than that. Specifically, each int will cover 32 bits, and those bits can be used to represent precisely 2^32 different things. Think about it: If I give you a panel of 3 light switches, and all you get to do is walk in, flip some switches, and walk back out again, then I walk in and I get to look at what you have been flipping, and that is all the communication channel we ever get, we can pre-arrange for 8 different signals. Why 8? Because that's 2^3. A bit is like that lightswitch. It's on, or off. There is no other option, and there is no 'unset'. There is no way to represent 'oh, you have not been in the room yet' unless we 'spend' one of our 8 different arrangements on this signal, leaving only 7 left.

Thus, if you want each 'int' to also know 'whether it has been set or not', and for 'not set yet' to be different from any of the valid values, you need an entire new bit, and given that modern CPUs don't like doing work on sub-word units, that one bit is excessively expensive. In either case, you have to program it.

For example:

private int table[];
private int set[];

LinearProbing(int size) {
  this.size = size;
  this.table = new int[size];
  this.set = new int[(size + 31) / 32];
}

boolean isSet(int idx) {
  int setIdx = idx / 32;
  int bit = idx % 32;
  return this.set[setIdx] >>> bit != 0;
}

private void markAsSet(int idx) {
  int setIdx = idx / 32;
  int bit = idx % 32;
  this.set[setIdx] |= (1 << bit);
}

This rather complex piece of machinery 'packs' that additional 'is it set?' bit into a separate array called set, which we can get away with making 1/32nd the size of the whole thing, as each int contains 32 bits and we just need 1 bit to mark an index slot as 'unset'. Unfortunately, this means we need to do all sorts of 'bit wrangling', and thus we're using the bitwise OR operator (|=), and bit shifts (<< and >>) to isolate the right bit.

This is why, usually, this is not the way, bit wrangling isn't cheap.

It's a much, much better idea to take away exactly one of the 2^32 different values a hash can be. You could choose 0, but you can also choose some arbitrarily chosen value; there is a very minor benefit to picking a large prime number. Let's say 7549.

Now all you need to do is decree a certain algorithm: The practical hash of a value is derived from this formula:

  • If the actual hash is 7549 specifically, we say the practical hash is 6961. Yes, that means 6961 will occur more often.
  • If the actual hash is anything else, including 6961, the practical hash is identical.

Tada: This algorithm means '7549' is free. No practical hash can ever be 7549. That means we can now use 7549 as marker as meaning 'unset'.

The fact that 6961 is now doubled up is technically not relevant: Any hash bucket system cannot just state that equal hashes means equal objects - after all, there are only 2^32 hashes, so collisions are mathematically impossible to avoid. That's why e.g. java's own HashMap doesn't JUST compare hashes - it also calls .equals. If you shove 2 different (as in, not .equals) objects in the same map that so happen to hash to the same value, HashMap is fine with it. Hence, having more conflicts around 6961 is not particularly relevant.

The additional cost associated with the additional chance of collision on 6961 is vastly less than the additional cost associated with keeping track of which buckets have been set or not. After all, assuming good hash distribution, our transformation algorithm that frees up 7549 means 1 in 4 billion items happens to collide twice more likely. That's... an infinitesimal occurrence on top of another infinitesimal, it's not going to matter.

NB: 6961 and 7549 are randomly chosen prime numbers. Prime numbers are merely slightly less likely to collide, it's not crucial that you pick primes here.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • "not a _bit_ more than that": it's really pedantic, but arrays definitely have a negligible amount of bytes used in the object header (e.g. 24 I believe for an `int[N]`). Though the answer overall is on-point. – Rogue Jul 25 '22 at 16:11
  • True, but it is also doing things, and it doesn't get larger as the array gets larger - thus not enough 'room' to additionally store a 'is it set?' flag for each and every item in the array. – rzwitserloot Jul 25 '22 at 19:51
  • @rzwitserloot Thanks for your detailed explanation. It is so generous of you. But in my case the solution suggested by Rogue in the comments worked out. Though the two were same in some cases. Usage of Wrapper class objects instead of primitives in java worked out. But coming to other languages your explanation is master class. – BhanuPrakashSakkuri Aug 01 '22 at 09:02
  • See how my answer says 'at great cost'? @Rogue's answer (use `Integer`), is that. An Integer is orders of magnitude larger and slower, in order to give you that extra `null` option. – rzwitserloot Aug 01 '22 at 12:06