Performant Haskell hashed structure.

Question

I am writing program that does alot of table lookups. As such, I was perusing the Haskell documentation when I stumbled upon Data.Map (of course), but also Data.HashMap and Data.Hashtable. I am no expert on hashing algorithms and after inspecting the packages they all seem really similar. As such I was wondering:

1: what are the major differences, if any?

2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?

You might be also interested in [this blog post](http://gregorycollins.net/posts/2011/06/11/announcing-hashtables). It shows a lot of interesting figures and compares different approaches to hash tables in Haskell. The post comes to the eventual conclusion, that you need impurity in order to make proper hashtables work fast. — fuz, Oct 25 '11 at 20:03
@JeremyW.Sherman Folks on SO often can provide insight that you could not discover on your own. For example, the author of the library or someone familiar could stumble upon the question and explain obscure configuration options, or a way to tweak something. Futhermore, someone could recommend a fourth option `SuperHashMap`. Finally, someone else in my position in the future can see this question and learn from it. — providence, Oct 25 '11 at 20:20
In general, don't identify code by module name. In the Haskell universe module names collide. For example, when you say `Data.HashMap` I am guessing you're talking about the `hashmap` package, which has worse performance than the `unordered-containers` HashMap implementation that I would suggest you use. — Thomas M. DuBuisson, Oct 25 '11 at 20:22
@FUZxxl Thank you for that link. That, combined with @mergeconflict 's answer has convinced me to try the cuckoo implementation in the `hashtables` package. Also, thank you @ThomasM.DuBuisson, I am still new to Haskell and need tips like these! — providence, Oct 25 '11 at 20:29
@ThomasM.DuBuisson It's not the case that `unordered-containers` is faster than `hashmap` for everything. I use `HashMap` from the former and `HashSet` from the latter for best performance in my particular application. — augustss, Oct 25 '11 at 23:54

score 65 · Accepted Answer · edited Mar 08 '20 at 18:45

65

1: What are the major differences, if any?

Data.Map.Map is a balanced binary tree internally, so its time complexity for lookups is O(log n). I believe it's a "persistent" data structure, meaning it's implemented such that mutative operations yield a new copy with only the relevant parts of the structure updated.
Data.HashMap.Map is a Data.IntMap.IntMap internally, which in turn is implemented as Patricia tree; its time complexity for lookups is O(min(n, W)) where W is the number of bits in an integer. It is also "persistent.". New versions (>= 0.2) use hash array mapped tries. According to the documentation: "Many operations have a average-case complexity of O(log n). The implementation uses a large base (i.e. 16) so in practice these operations are constant time."
Data.HashTable.HashTable is an actual hash table, with time complexity O(1) for lookups. However, it is a mutable data structure -- operations are done in-place -- so you're stuck in the IO monad if you want to use it.

2: Which would be the most performant with a high volume of lookups on maps/tables of ~4000 key-value pairs?

The best answer I can give you, unfortunately, is "it depends." If you take the asymptotic complexities literally, you get O(log 4000) = about 12 for Data.Map, O(min(4000, 64)) = 64 for Data.HashMap and O(1) = 1 for Data.HashTable. But it doesn't really work that way... You have to try them in the context of your code.

edited Mar 08 '20 at 18:45

Gal

5,338
5
33
55

answered Oct 25 '11 at 20:18

mergeconflict

8,156
34
63

Thanks, I'm in the IO monad at this point anyways due to lots of file IO. The result of all of these IO and hash lookups is yielded from a coroutine, so it'll be easy to escape IO afterwards. I think I'll give Data.HashTable a shot and see how it does. Thank you very much for your help~ – providence Oct 25 '11 at 20:25
3

Sure thing; also check out `Data.HashTable.ST` from the [hashtables](http://hackage.haskell.org/package/hashtables) package, noted first above by @FUZxxl. – mergeconflict Oct 25 '11 at 20:31
the time complexity `O(min(n, W))` for a Patricia tree doesn't seem right ... might as well use a list then, no? – gatoatigrado Aug 07 '13 at 18:31
oops, I'm wrong, the worst case is linear. Also, the Patricia tree comment is in Data.HashMap.Strict (type HashMap) it seems ... – gatoatigrado Aug 07 '13 at 22:09
@mergeconflict Hi, where can i read the internal implemenation of data structures like hashmap in detail in haskell? – rohitwtbs Mar 25 '19 at 05:30
@rohitwtbs, you can checkout most sources in either hackage or github. For example, [unordered-containers-0.2.10.0](https://hackage.haskell.org/package/unordered-containers-0.2.10.0) – Gal Mar 08 '20 at 18:47

score 14 · Answer 2 · answered Oct 25 '11 at 20:17

The obvious difference between Data.Map and Data.HashMap is that the former needs keys in Ord, the latter Hashable keys. Most of the common keys are both, so that's not a deciding criterion. I have no experience whatsoever with Data.HashTable, so I can't comment on that.

The APIs of Data.HashMap and Data.Map are very similar, but Data.Map exports more functions, some, like alter are absent in Data.HashMap, others are provided in strict and non-strict variants, while Data.HashMap (I assume you meant the hashmap from unordered-containers) provides lazy and strict APIs in separate modules. If you are using only the common part of the API, switching is really painless.

Concerning performance, Data.HashMap of unordered-containers has pretty fast lookup, last I measured, it was clearly faster than Data.IntMap or Data.Map, that holds in particular for the (not yet released) HAMT branch of unordered-containers. I think for inserts, it was more or less on par with Data.IntMap and somewhat faster than Data.Map, but I'm a bit fuzzy on that.

Both are sufficiently performant for most tasks, for those tasks where they aren't, you'll probably need a tailor-made solution anyway. Considering that you ask specifically about lookups, I would give Data.HashMap the edge.

Be _sure_ you are talking about the same `Data.Hashmap`. Unordered-containers was measureably faster than other package's `HashMap` offerings, last I checked. — Thomas M. DuBuisson, Oct 25 '11 at 20:21
Good point. I had completely forgotten there was a previous `Data.HashMap`. — Daniel Fischer, Oct 25 '11 at 20:31
In 2015, there isn't anymore Data.HashMap in base. There is one in the hashtables package — sinelaw, Jun 24 '15 at 19:24

score 3 · Answer 3 · edited Jun 29 '15 at 14:10

3

Data.HashTable's documentation now says "use the hashtables package". There's a nice blog post explaining why hashtables is a good package here. It uses the ST monad.

edited Jun 29 '15 at 14:10

Pubby

51,882
13
139
180

answered Aug 07 '13 at 22:11

gatoatigrado

16,580
18
81
143

The `hashtables` package uses the `ST` monad not the `State` monad. – is7s Aug 08 '13 at 00:09
right, `ST` is just the transformer (i.e. useful) version of `State`. – gatoatigrado Aug 08 '13 at 00:28
11

No it's not. There's a huge conceptual difference. The `ST` monad carries the state at the type level while the `State` monad carries the state at the value level :) – is7s Aug 08 '13 at 00:39

Performant Haskell hashed structure.

3 Answers3