Which is a better implementation to implement a trie node's children - array or hashmap?

Question

I am reading about trie data structure and found two implementations to implement the children in trie node. Following are the details of the two implementations :-

1) Trie node array of length 26 has been used to store children of the trie node.

2) HashMap has been used to store children of the trie node with character as the key and Trie node as the value.

Please let me know which implementation is better and why?

NOTE: Memory required by `unordered_map` is 5 times, and `map` is 6 (to 8) times - more than the memory required by `array`. See - https://stackoverflow.com/questions/59480439/unordered-map-vs-map-vs-array-memory-analysis — Manohar Reddy Poreddy, Jan 12 '22 at 00:52

dreamzor · Answer 1 · 2016-09-15T20:08:38.987

It depends - a usual trade-off between memory and speed.

If your strings are short and you don't have memory issues then of course go for the array. This way you make your searches faster. Also good if your letters are evenly distributed among words.

If your strings can be large and there are some letters that occur very rarely then go for the hash map. This way you don't occupy too much unused memory. That's also better if your alphabet is much larger than 26 letters.

Array is faster but potentially consumes more memory than HashMap - but not necessary. Imagine that your bag of words contain all the possible 26^N words of length N that can be made of 26 letters. Then HashMap would be both slower and consume more memory.

Felt following is vague `Imagine that your bag of words contain all the possible 26^N words of length N that can be made of 26 letters. Then HashMap would be both slower and consume more memory.`. What is the scenario's best & worst case? What is the relative comparable slowness & memory cost, that you have really found in practice? — Manohar Reddy Poreddy, Jan 12 '22 at 00:33

score 4 · Answer 2 · answered Sep 16 '16 at 14:02

There are two very common structures used for trie nodes:

CharNode
    char letter
    CharNode[26] children

CharNode
    char letter
    Dictionary<char, CharNode> children

Those work well, but they waste a huge amount of memory because the list of children is remarkably sparse. In my opinion, neither gives a performance benefit that offsets the memory cost. I prefer to use:

CharNode
    char letter
    CharNode[] children

or

CharNode
    char letter
    CharNode* firstChild
    CharNode* sibling

In the first case, the children array is variably sized to hold just the number of children actually used, and the children are arranged with the most frequently used letter first. Sequential search finds the required child.

In the second case, you have a linked list of children, and each child has a sibling pointer. Again, children are arranged in the list based on frequency.

I prefer the second, because in many runtime environments the cost of allocating an array is pretty high. In .NET, for example, array allocation overhead is on the order of 50 bytes. Considering that a trie node often has fewer than five children, the array allocation overhead is larger than the data that the array holds. With the linked list arrangement, you don't waste any memory.

Sequential search of the small child list is very fast, because the list of children to search is usually very short and the distribution of letter frequencies is usually very skewed. That is, the first two children usually are used far more frequently than the rest. So on average you'll only have to search two or three child nodes.

Either of these saves a huge amount of memory, which can make for a faster program. My tests haven't shown an appreciable performance hit in going with these alternate structures.

I'm not sure what language/implementation you're using, but why would a dictionary waste memory? Doesn't it only fill up as you add new entries? In your proposed schemes, it's not clear how you keep track of most frequently used letters. Would you need an extra data structure or additional fields to track this? — Lorenz Forvang, Mar 25 '19 at 18:46
@LorenzForvang A dictionary has large overhead. In C#, for example, each key/value pair in a dictionary has a minimum of 16 bytes of overhead. When you start talking about hundreds of millions of nodes, that gets expensive. The most frequently used letters are computed when you build the trie. My typical use of tries is building a static trie. But you could also do it at runtime: you'd just need to keep track of the character frequency. — Jim Mischel, Mar 25 '19 at 19:04
@JimMischel please share an implementation of `But you could also do it at runtime: you'd just need to keep track of the character frequency.` — Manohar Reddy Poreddy, Jan 12 '22 at 00:27

score 2 · Answer 3 · answered May 22 '19 at 07:49

array is classic textbook implementation, in default choice.

hashmap cost less memory, when alphabetes are large and the actual number of keys in use is relative small.but hashmap itself structure cost more memory than array. so it's trade off and depends on the actual trie's keys.

access speed per child link is almost same O(1).

Which is a better implementation to implement a trie node's children - array or hashmap?

3 Answers3