3

I have a problem I'd like to code. I have a process which generates numbers 0 through n-1 and I want to stop it when it generates the first repeated element.* I'm looking for a data structure that makes this fast. In particular, adding a new element and testing if an element is in the structure need to be fast. The expected number of insertions is around sqrt(n) (birthday problem) or actually a bit worse (say sqrt(2n)) because the process slightly favors unique values. In other words, it is rather sparse -- working with the numbers up to a billion only about 30 or 50 thousand values will be used.

A hash set or some kind of self-balancing binary tree seems like the right approach, but maybe there's a better way? For small n I think a bit array would be superior but I'm looking at n around 10^9 which is too large for that to be practical I think.

* Actually, it doesn't need to stop right away -- if it's more efficient you can generate elements in blocks and check every now and then.


Note: This was originally posted on math.se but they recommended that I repost here. It's not research-level and so not suitable for cstheory.se.

Charles
  • 11,269
  • 13
  • 67
  • 105
  • Hash set is the way to go, just use the modulo operator and put them in a set of reasonable size. – clcto Nov 05 '13 at 17:07

2 Answers2

2

A hash table is indeed the way to go. A properly optimized hash set of integers can be almost (can't quite ignore the load factor) as space efficient as a plain array while retaining the high performance you'd expect. Use the key as hash value, don't store the hash value twice, keep the table size a power of two (and hence use a bit mask instead of modulo). If you use open addressing and need deletion, you can borrow a bit from the key to mark tombstones.

For 50k items, these optimizations are probably not worth writing your own hash table (though it's a fun exercise in its own right!). If you can use the existing hash set in your language of choice, use it. Otherwise, see Fast and Compact Hash Tables for Integer Keys for a survey and benchmark of various approaches, and consider Robin Hood Hashing which is very easy to implement, has decent worst-case guarantees, and although it's not mentioned in the aforementioned paper, it's quite fast in my experience (mostly because it's a simple modificationof linear probing and inherits its advantages). My C implementation — unfortunately not public yet — is not even 250 lines of code including blanks and comments, none of which are tricky (in comparision to other hash tables). This is without any micro optimizations.

  • Thank you. Can you recommend a hash table implementation in C, or do you suggest I roll my own? (I liked the article, by the way, but Robin Hood hashing is very inappropriate here -- lookups for missing elements happen all all but the last insertion.) – Charles Nov 05 '13 at 21:23
  • @Charles I can't recommend a C hash table: I've heard of a few, but never used any of them seriously. I think rolling your own is a reasonable choice, but not the first choice. I'm not sure why you think Robin Hood hashing is inappropriate? Lookups for missing elements are, as the article says, fast -- in contrast to other open addressing schemes. The search terminates once the probe count is greater than the probe count of the element currently being probed -- which happens after a few elements, due to the low average and maximum probe count. –  Nov 06 '13 at 10:18
  • ~6 probes to determine if an element is present is too high. Linear probing would be faster for my purposes (and perhaps other, more sophisticated versions would be faster still). – Charles Nov 06 '13 at 16:41
  • @Charles The ~6 is a worst case figure (derived under the assumption of a good hash function). If linear probing can terminate earlier because of an empty slot, so can robin hood hashing -- either the two schemes need the same amount of probes, or robin hood hashing needs less. It never needs more, by construction. That doesn't necessarily mean it's faster in total, as it needs more bookkeeping in other places, but claiming it needs more probes is just wrong. –  Nov 06 '13 at 17:19
0

I think the best data structure is hashTable. And the most important part is hash function either you can create your own or you can use MurmurHash/ CityHash for uniform distribution.

Trying
  • 14,004
  • 9
  • 70
  • 110
  • His keys are already integers. There's no reason to hash them, or if there is, any suitable algorithm will be very different from hashes that process variable-length byte strings. –  Nov 05 '13 at 20:50