Computers don't work that way, at least, not without paying a rather great cost.
Specifically, a new int[10]
quite literally just creates a contiguous block of memory that is precisely large enough to hold 10 int
variables, and not a bit more than that. Specifically, each int will cover 32 bits, and those bits can be used to represent precisely 2^32 different things. Think about it: If I give you a panel of 3 light switches, and all you get to do is walk in, flip some switches, and walk back out again, then I walk in and I get to look at what you have been flipping, and that is all the communication channel we ever get, we can pre-arrange for 8 different signals. Why 8? Because that's 2^3
. A bit is like that lightswitch. It's on, or off. There is no other option, and there is no 'unset'. There is no way to represent 'oh, you have not been in the room yet' unless we 'spend' one of our 8 different arrangements on this signal, leaving only 7 left.
Thus, if you want each 'int' to also know 'whether it has been set or not', and for 'not set yet' to be different from any of the valid values, you need an entire new bit, and given that modern CPUs don't like doing work on sub-word units, that one bit is excessively expensive. In either case, you have to program it.
For example:
private int table[];
private int set[];
LinearProbing(int size) {
this.size = size;
this.table = new int[size];
this.set = new int[(size + 31) / 32];
}
boolean isSet(int idx) {
int setIdx = idx / 32;
int bit = idx % 32;
return this.set[setIdx] >>> bit != 0;
}
private void markAsSet(int idx) {
int setIdx = idx / 32;
int bit = idx % 32;
this.set[setIdx] |= (1 << bit);
}
This rather complex piece of machinery 'packs' that additional 'is it set?' bit into a separate array called set
, which we can get away with making 1/32nd the size of the whole thing, as each int contains 32 bits and we just need 1 bit to mark an index slot as 'unset'. Unfortunately, this means we need to do all sorts of 'bit wrangling', and thus we're using the bitwise OR operator (|=
), and bit shifts (<<
and >>
) to isolate the right bit.
This is why, usually, this is not the way, bit wrangling isn't cheap.
It's a much, much better idea to take away exactly one of the 2^32 different values a hash can be. You could choose 0, but you can also choose some arbitrarily chosen value; there is a very minor benefit to picking a large prime number. Let's say 7549.
Now all you need to do is decree a certain algorithm: The practical hash of a value is derived from this formula:
- If the actual hash is 7549 specifically, we say the practical hash is 6961. Yes, that means 6961 will occur more often.
- If the actual hash is anything else, including 6961, the practical hash is identical.
Tada: This algorithm means '7549' is free. No practical hash can ever be 7549. That means we can now use 7549 as marker as meaning 'unset'.
The fact that 6961 is now doubled up is technically not relevant: Any hash bucket system cannot just state that equal hashes means equal objects - after all, there are only 2^32 hashes, so collisions are mathematically impossible to avoid. That's why e.g. java's own HashMap doesn't JUST compare hashes - it also calls .equals
. If you shove 2 different (as in, not .equals
) objects in the same map that so happen to hash to the same value, HashMap is fine with it. Hence, having more conflicts around 6961 is not particularly relevant.
The additional cost associated with the additional chance of collision on 6961 is vastly less than the additional cost associated with keeping track of which buckets have been set or not. After all, assuming good hash distribution, our transformation algorithm that frees up 7549 means 1 in 4 billion items happens to collide twice more likely. That's... an infinitesimal occurrence on top of another infinitesimal, it's not going to matter.
NB: 6961 and 7549 are randomly chosen prime numbers. Prime numbers are merely slightly less likely to collide, it's not crucial that you pick primes here.