Why do I get so many collisions in my custom closed-hashset?

Question

I have a custom closed-hashset/open-addressing (i.e. no linked lists) class. It's very specific to my needs - it's not generic (only for positive long numbers), needs the amount of records to be inserted to be predefined, and doesn't support remove - but it is meant to be as little space-consuming as possible.

Since it has so little functionality, it's a really small and simple class. However for some reason, when i insert many entries, the number of collisions becomes much too high much too fast.

Some code (Java):

public class MyHashSet
{
    private long[] _entries;

    public MyHashSet(int numOfEntries)
    {
        int neededSize = (int)(numOfEntries / 0.65D);
        _entries = new long[neededSize];
    }

    public void add(long num)
    {
        int cell = ((Long) (num % _entries.length)).intValue();

        while (_entries[cell] != 0)
        {
            if (++cell >= _entries.length)  
                cell = 0;                   
        }

        _entries[cell] = num;
    }
...

I have a main which instansiates a MyHashSet object with 10 million as a parameter, then calls add() 10 million times with a different randomly-generated (yet positive) Long number. While on the normal Java HashSet this insertion takes about a second as a whole, it takes about 13 seconds for it to finish with MyHashSet. I added a counter for collisions and indeed, the number of collisions is 3-6 billion - way more than expected (I'd guess about 30-40 million is to be expected).

Am I doing something wrong? Is there something wrong with the hashing itself? Why would there be so many collisions, and what can I do about it?

Thank you!

P.S.: The number 0.65 in the code represents that the table will only get 65% filled, which I know is supposed to be working well in closed hashsets. For this matter, even if i set it to 20%, the insertion still takes > 10 seconds..

-- EDIT --

This is quite embaressing to admit, but my test code recreated the Random object (with System.currentTimeMillis() as a seed) in each iteration of the loop, rather than using the same one for the entire run..

After fixing it, it takes about 2-3 seconds for the insertion to be done with. This still seems too much in comparison - why would the default java HashSet only take a second to insert to, when it is more 'complex' than MyHashSet? I now get around 9 millions collisions only. I also tried taking the logging code off to see if it helps but it still won't make for the difference. I'd appreciate any ideas, and sorry again for the confusion before.

Are you sure that your randomly generated numbers sufficiently express the 10 million range? For instance, if you generate 10 million numbers in the 0..1Million range, there will be ALOT of duplicates, which is to be expected using simply the modulo of the size. Can you show us the testing code? — pcalcao, Mar 26 '12 at 14:11

Louis Wasserman · Accepted Answer · 2012-03-26T14:52:25.213

3

The first thing I notice is gratuitous boxing on the line

int cell = ((Long) (num % _entries.length)).intValue();

which is much slower than

int cell = (int) (num % _entries.length);

(Note that num % _entries.length will always fit in an int, since _entries.length is itself an int.)

Admittedly, Java's HashSet would suffer from similar overhead anyway, but that's at least one obvious thing to fix.

Also, it's probably to your advantage to make sure that the table size is a prime number. The simplest way to do this is BigInteger.valueOf((int)(numOfEntries / 0.65)).nextProbablePrime().intValue(), and since it's a one-time cost it shouldn't affect overall performance too badly.

Alternately, Java's HashSet uses power-of-2 hash table sizes, so it can use a mask (value & (_entries.length - 1), basically) rather than %, which is frequently more expensive.

edited Mar 26 '12 at 14:52

answered Mar 26 '12 at 14:42

Louis Wasserman

191,574
25
345
413

I'd be less worried about such a little performance overhead (although it's useless and should be removed anyhow) than the fact that we get array out of bounds exceptions a bit over 50% of the time if the hash is uniformly distributed. Using prime numbers instead of powers of 2 is one possible implementation, but somewhat costly so that's a tradeoff (we can get acceptable results without it too). The real problem is the linear next pos function (even worse, only in one direction!) – Voo Mar 26 '12 at 14:49
Where would you get an out of bounds exception? – Louis Wasserman Mar 26 '12 at 14:51
I don't mind the linear probing technique so badly, especially when the inputs are supposed to be uniformly randomly distributed. – Louis Wasserman Mar 26 '12 at 14:53
@Voo: The OP specified that all inputs are strictly positive. ;) (I worried about that myself briefly before realizing that it wouldn't occur for this scenario.) – Louis Wasserman Mar 26 '12 at 14:53
Thanks for all the comments. About the boxing issue, while my tests use completely random Long numbers, my actual use will have long numbers with similar first 4 bytes (starting with the MSB that is), and so the result of (int) (num % _entries.length) would actually not be good for me since it will throw away the 4 LSBs which make for the diversity in the hashing... – user976850 Mar 26 '12 at 15:00
Overread that, right - though I'd still fix it, it's cheap and why leave such a bug in it. Linear probing (even his bugged version) should work just as fine as any other solution with perfectly uniform input, but I can't think of much other problems than a problem there. – Voo Mar 26 '12 at 15:02
@user976850 If I'm not completely out of my mind the code does exactly the same as the simpler one, except with a hefty performance penalty. If the higher 4 bits are important, use a prime for the table size or do some simple bit shifting. – Voo Mar 26 '12 at 15:04
@user976850, `(int) (num % _entries.length)` is exactly equivalent to your old code. It gives the exact same result for the exact same input. But `%` preserves lower-significance bits more than high-significant bits, which seems like what you want, right? – Louis Wasserman Mar 26 '12 at 15:07
sorry i guess i got it wrong. they indeed give the same result. However it didnt seem to boost the performance at all really (perhaps the compiler already sorted it out backstage?). The prime-size table suggestion didnt help performance either. However, The & instead of % tip did help, takes about 1.8 seconds now. Still slower than the regular hashset though. – user976850 Mar 26 '12 at 15:16

score 1 · Answer 2 · answered Mar 26 '12 at 14:57

First: Fix your modulo function. You'll get ArrayOutOfBounds exceptions otherwise and it's easy to fix for no real performance cost (just an and). Also if you're at it, do what Louis proposes and get rid of the useless long cast.

Anyway the real problem is that you're using a horrible next function if the cell is already taken. Linear probing is generally a bad idea and then you're even making it worse by only going into one direction. If your numbers are not perfectly uniformly arranged you'll get lots of clashes. Double hashing works pretty good in practice, but you could also fix your linear probing and test if that helps.

Then you should either use a prime number for the table size as Louis proposes which has some (theoretically provable) advantages but is slower, or use the next power of 2. At the moment you're combining the disadvantages of both approaches.

Why do I get so many collisions in my custom closed-hashset?

2 Answers2