I have a custom closed-hashset/open-addressing (i.e. no linked lists) class. It's very specific to my needs - it's not generic (only for positive long numbers), needs the amount of records to be inserted to be predefined, and doesn't support remove - but it is meant to be as little space-consuming as possible.
Since it has so little functionality, it's a really small and simple class. However for some reason, when i insert many entries, the number of collisions becomes much too high much too fast.
Some code (Java):
public class MyHashSet
{
private long[] _entries;
public MyHashSet(int numOfEntries)
{
int neededSize = (int)(numOfEntries / 0.65D);
_entries = new long[neededSize];
}
public void add(long num)
{
int cell = ((Long) (num % _entries.length)).intValue();
while (_entries[cell] != 0)
{
if (++cell >= _entries.length)
cell = 0;
}
_entries[cell] = num;
}
...
I have a main which instansiates a MyHashSet object with 10 million as a parameter, then calls add() 10 million times with a different randomly-generated (yet positive) Long number. While on the normal Java HashSet this insertion takes about a second as a whole, it takes about 13 seconds for it to finish with MyHashSet. I added a counter for collisions and indeed, the number of collisions is 3-6 billion - way more than expected (I'd guess about 30-40 million is to be expected).
Am I doing something wrong? Is there something wrong with the hashing itself? Why would there be so many collisions, and what can I do about it?
Thank you!
P.S.: The number 0.65 in the code represents that the table will only get 65% filled, which I know is supposed to be working well in closed hashsets. For this matter, even if i set it to 20%, the insertion still takes > 10 seconds..
-- EDIT --
This is quite embaressing to admit, but my test code recreated the Random object (with System.currentTimeMillis() as a seed) in each iteration of the loop, rather than using the same one for the entire run..
After fixing it, it takes about 2-3 seconds for the insertion to be done with. This still seems too much in comparison - why would the default java HashSet only take a second to insert to, when it is more 'complex' than MyHashSet? I now get around 9 millions collisions only. I also tried taking the logging code off to see if it helps but it still won't make for the difference. I'd appreciate any ideas, and sorry again for the confusion before.