15

I am confused as to how the Trie implementation saves space & stores data in most compact form!

If you look at the tree below. When you store a character at any node, you also need to store a reference to that & thus for each character of the string you need to store its reference. Ok we saved some space when a common character arrived but we lost more space in storing a reference to that character node.

So isn't there a lot of structural overhead to maintain this tree itself ? Instead if a TreeMap was used in place of this, lets say to implement a dictionary, this could have saved a lot more space as string would be kept in one piece hence no space wasted in storing references, isn't it ?

enter image description here

Rajat Gupta
  • 25,853
  • 63
  • 179
  • 294
  • If a node takes 16 bytes but is reused in more than 16 strings (8 in Java), it saves space. Then it is simply a question of whether you save more space than you are wasting. Assuming that the blue numbers in your example are repeat counts, the savings do turn out to be larger than the wasted space, compared to a simple array of strings. However in this case it would be even better to store complete strings with repeat counts. – han Nov 25 '11 at 06:50

5 Answers5

16

To save space when using a trie, one can use a compressed trie (also known as a patricia trie or radix tree), for which one node can represent multiple characters:

In computer science, a radix tree (also patricia trie or radix trie) is a space-optimized trie data structure where each node with only one child is merged with its child. The result is that every internal node has at least two children. Unlike in regular tries, edges can be labeled with sequences of characters as well as single characters. This makes them much more efficient for small sets (especially if the strings are long) and for sets of strings that share long prefixes.

Example of a radix tree:

radix tree or patricia trie

Note that a trie is usually used as an efficient data structure for prefix matching on a set of strings. A trie can also be used as an associative array (like a hash table) where the key is a string.

David Hu
  • 3,076
  • 24
  • 26
  • I had a look at the Patricia Trie implementation but is it a part of any popular libraries like Guava & Apache Commons as they per their claim? I couldn't figure out its implemenation in the Guava/ apache commons collections – Rajat Gupta Nov 25 '11 at 07:10
  • 3
    @Marcos There's no trie implementation in Guava, though there's a long running issue to add one so it may happen eventually. – ColinD Nov 25 '11 at 08:07
  • @David do the numbers indicate the values? – Pacerier Dec 05 '11 at 17:43
  • @DavidHu: I am also working on Patricia Trie problem here [here](http://stackoverflow.com/questions/22801857/how-to-implement-insert-method-in-a-trie-data-structure). And currently I am stuck. If you can help me there, then it will be of great help.. Thanks.. –  Apr 03 '14 at 04:48
7

Space is saved when you've lots of words to be represented by the tree. Because many words share the same path in the tree; the more words you've, more space you would save.

But there is a better data structure if you want to save space. Trie doesn't save space as much as directed acyclic word graph (DAWG) does, because it shares common node throughout the structure, whereas trie doesn't share nodes. The wiki entry explains this much detail, so have a look at it.

Here is the difference (graphically) between Trie and DAWG:

enter image description here

The strings "tap", "taps", "top", and "tops" stored in a Trie (left) and a DAWG (right), EOW stands for End-of-word.

The tree on the left side is Trie, and the tree on the right is DAWG. Compare them and see how DAWG saves space effciently. Trie has duplicate nodes that represent same letter/subword, while DAWG has exactly one node for each letter/subword.

Nawaz
  • 353,942
  • 115
  • 666
  • 851
  • This is what I don't understand. For each character we save, we pay the price of a pointer.. so isn't that worse? – Pacerier Dec 05 '11 at 17:48
  • @Pacerier: How many times do you pay for the pointer? Once you pay for it, you can use for as many repetition of same as character as you want. – Nawaz Dec 05 '11 at 17:49
  • Separately, I don't see how dawg can save space for What's the probability that two diff branches have the same tail? Eg, `topsman` is a word but obviously not `tapsman`; so you need two tails anyway for the typical problem statement (Eng dict in mem), no? – Pacerier Apr 08 '23 at 22:56
5

It's not about cheap space in memory, it's about precious space in a file or on a communications link. With an algorithm that builds that trie, we can send 'ten' in three bits, left-right-right. Compared to the 24 bits 'ten' would take up uncompressed, that's a huge savings of valuable disk space or transfer bandwidth.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • that's really a great advantage! – Rajat Gupta Nov 25 '11 at 07:04
  • so, just for in memory structures with no need to transfer data but for a performant and space efficient solution for getting search suggestions for a telephone names directory of around 10,000 names, would using Trie be recommended over TreeMap ? – Rajat Gupta Nov 25 '11 at 07:08
  • @David, re "left-right-right"; That's patricia instead of trie isn't it? – Pacerier Apr 09 '23 at 03:39
3

You might deduce that it save space is on a ideal machine where every byte is allocated efficiently. However real machines allocate aligned blocks of memory (8 bytes on Java and 16 bytes on some C++) and so it may not save any space.

Java Strings and collections add relatively high amount of over head so the percentage difference can be very small.

Unless your structure is very large the value of your time out weights the memory cost that using the simplest, most standard and easiest to maintain collection is far more important. e.g. your time can very easily be worth 1000x or more the value of the memory you are try to save.

e.g. say you have 10000 names which you can save 16 bytes each by using a trie. (Assuming this can be proven without taking more time) This equates to 16 KB, which at today's prices is worth 0.1 cents. If your time costs your company $30 per hour, the cost of writing one line of tested code might be $1.

If you have think about it a blink of an eye longer to save 16 KB, its unlikely to be worth it for a PC. (mobile devices are a different story but the same argument applies IMHO)

EDIT: You have inspired me to add an update http://vanillajava.blogspot.com/2011/11/ever-decreasing-cost-of-main-memory.html

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • The trie would be faster and save space. For 15K entries it could save you 0.2 cents of memory and CPU. If you saw what could be 0.2 cents on the other side of the road would you cross to pick it up? I would only do this if it takes about a second of your time. Given TreeMap is a built in, well tested, document, and understood by anyone having to support your code, it will save you far, far, far more in time than it cost in memory (unless you are using many devices will limited memory) – Peter Lawrey Nov 25 '11 at 09:21
  • 1
    If you are writing a library being deployed to thousands or millions of consumers, that 0.2 cents has a multiple, and when being deployed to servers that charge by usage, that 0.2 cents has another multiple. "Performance doesn't matter" is not a solution, it's an ideology. – Ajax Jan 24 '13 at 13:06
  • If you saving 0.2 cents in one million machines that is $2000 in total. This is worth spending a few days on or even a week. If it's only 100K machines you are looking at a few hours or even a day. If it's only 10K machines you have looking a few minutes. If it's only a thousand machine or less you could be wasting your time worrying about it at all. Scale does matter, and most projects don't get deployed to enough machines that worrying about small amounts of resources are a good idea. – Peter Lawrey Jan 24 '13 at 13:13
  • 2
    I prefer a more optimistic approach of always chosing the most performant solution, even if it takes a little longer. So long as you benchmark and know what situation to use what method in for the best results, you'll always know where the bottlenecks are, and avoid them out of habit. Every time I see someone use ArrayList.add(0, item), I leave comments to look at LinkedList instead. If you don't know what your tools are doing under the hood, you will make mistakes that lead to a sluggish app. Paying for server costs is one thing, but a potential user's first impression is priceless. – Ajax Jan 26 '13 at 17:53
  • @PeterLawrey, Correct answer but to wrong question. – Pacerier Apr 08 '23 at 23:01
  • @Pacerier possibly, though the OP thought it was the correct answer even though I answered after the top rating answer. Sometimes I answered what the OP likely meant to ask, rather than literally what they did ask. – Peter Lawrey Apr 12 '23 at 08:32
1

Guava may indeed store the key at each level but the point to realize is that the key does not really need to be stored because the path to the node completely defines the key for that node. All that actually needs to be stored at each node is a single boolean indicating whether this is a leaf node or not.

Tries, like any other structure, excel at storing certain types of data. Specifically, tries are best at storing strings that share a common root. Think of storing full-path directory listings for example.

OldCurmudgeon
  • 64,482
  • 16
  • 119
  • 213