5

I've been studying tries and checking out their advantages and disadvantages. They're quite useful in many practical applications like dictionary, spell checkers etc due to their constant O(m) look-ups (where m is length of the string) and other advantages like providing ordered retrieval of strings, and getting common prefixes. So, the advantages are pretty clear to me, but the limitations are a bit confusing.

I'm following this link : https://en.wikipedia.org/wiki/Trie

Drawbacks listed here are:

  1. Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random-access time is high compared to main memory.

Follow up question - Why is there a scenario involving secondary storage? Aren't tries also supposed to be stored in main memory. If they're stored in secondary storage, then there's no use of using trie anyways as disk access will always cause greater times.

  1. Some tries can require more space than a hash table, as memory may be allocated for each character in the search string, rather than a single chunk of memory for the whole entry, as in most hash tables.

Follow-up question : Is it due to the fact that tries would contain more references/pointers for connecting each character to next one, and that'd consume more bytes than if it was stored as a whole string? (I got this reason from one of the answers here). Can anyone elaborate this too?

I'd really appreciate some help here. Thanks.

gaurav jain
  • 3,119
  • 3
  • 31
  • 48

2 Answers2

5

First, "constant O(m) look-ups" is meaningless. Lookup time in a trie is O(m): it depends on the length of the string you're looking up.

A well constructed hash table (i.e. a good hash function and a reasonable load factor) has O(1) lookup time.

Assuming competent construction, looking up a string in a hash table will be much faster than looking it up in a trie.

Tries and hash tables are used for different things. If all you want is the ability to lookup a word, then a hash table will be faster. If you want to find common prefixes, ordered retrieval, or do similar things, then you want a trie.

A hash table can look up individual strings very quickly. It's like a thoroughbred racehorse. That's all it can do. A trie, on the other hand, is a workhorse that can do a lot of things. It'll never be as fast at lookups as a hash table, but it can do lots of things that the hash table can't do.

For example, finding all the words that start with "pre" will take O(n) time with a dictionary because you have to search all of the words. With a trie, it takes three probes to find the subtree that contains all of those words, and then all you have to do is traverse that subtree. Sure, the worst case is O(n), but that's only if all the words in your trie start with "pre".

Whereas it's true that going to disk will be slower than if the entire trie were in memory, it's wrong to say that a disk-based trie offers no advantage over alternatives. If the data won't fit in memory, then no matter what data structure you use, you'll need some external (i.e. non-memory) storage. The fact that your data access is slower when it's on the disk does not fundamentally change the advantages or disadvantages of trie vs. hash table. For example, a disk-based trie will still be faster than a disk-based hash table when it comes to finding all the words with a particular prefix.

A hash table's overhead is typically a constant multiple of the number of words it contains. That is, in addition to the memory required to store the strings, there is per-string overhead to store the mapping between hash code and string.

Memory for a trie is a little more involved. In the worst case, there is one node per character. All those little node allocations start adding up. Imagine a dictionary that contains 200,000 words, and the average word length is five characters. That's a million nodes of overhead.

Fortunately, there are ways to greatly compress a trie, without losing much, if any, performance. The resulting data structure is much smaller and more cache-friendly than a naively constructed trie.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351
  • Hi Jim, thanks for your answer. Yes, it's wrong to say constant O(m) lookups. Other than that, doesn't a hash take O(m) time for computing the hash, hence the total time for hash lookups should be O(m) ? (Else the hash for "gaur" and "gaurav" would be same). Can you clarify a bit more on this part ? – gaurav jain Sep 29 '15 at 15:27
1

It's been a while since this was asked, but I'd like to add, if anyone is wondering, that a good hashing function should take O(1) time for fixed memory values such as primitive types or fixed-length lists of primitive types. The same logical operations are often applied on all values to be hashed (logical shift left and right, bitwise operations, etc.). These operations take the same time regardless of what value they're used on. This makes hash tables far quicker, and relatively reliable, at storing values that use up a predictable amount of space. Hashing a string can also be done in O(1) time if you traverse the underlying character array and only pick out characters at intervals to ensure that you're always hashing the same amount of memory.

For example, for a string of length 10, you may hash the 10 characters in the underlying character array, whereas for a string of length 100, you hash based on every tenth character.

So, to answer your question, hashing is usually completed in constant time, whereas insertion or retrieval from a trie is O(n) time, where n is the length of the value to be inserted or retrieved. Even if there is little difference in practice, constant has the advantage of being predictable. All operations on a hash table will take the same time each time, give or take. But with a trie (representing a dictionary of Welsh place names), searching for Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch with one character at the end changed will take far more time than searching for "a". The system will eat through the whole string before realising that it is not in the dictionary. Google and other tech companies tend to prefer nice, predictable (but evenly distributed) hashing to avoid security concerns.