Space-Efficient Data Structure for Storing a Word List?

Question

Is there anything better than a Trie for this situation?

Storing a list of ~100k English words
Needs to use minimal memory
Lookups need to be reasonable, but don't have to be lightning fast

I'm working with Java, so my first attempt was to just use a Set<String>. However, I'm targeting a mobile device and already running low on memory. Since many English words share common prefixes, a trie seems like a decent bet to save some memory -- anyone know some other good options?

EDIT - More info - The data structure will be used for two operations

Answering: Is some word XYZ in the list?
Generating the neighborhood of words around XYZ with one letter different

Thanks for the good suggestions

@Milhous, now I'm interested in knowing what you're going to suggest is possible WITH a network connection... — paxdiablo, Dec 13 '08 at 11:31

paxdiablo · Answer 1 · 2020-07-25T10:59:04.230

One structure I saw for minimizing space in a spelling dictionary was to encode each word as:

the number of characters (a byte) in common with the last; and
the new ending.

So the word list

HERE            would encode as    THIS
sanctimonious                      0,sanctimonious
sanction                           6,on
sanguine                           3,guine
trivial                            0,trivial

You're saving 7 bytes straight up there (19%), I suspect the saving would be similar for a 20,000 word dictionary just due to the minimum distances between (common prefixes of) adjacent words.

For example, I ran a test program over a sorted dictionary and calculated old storage space as the word length plus one (for terminator), and the new storage space as one for the common length, the uncommon suffix length and one for a terminator. Here's the final part of that test program showing that you could well over 50%:

zwiebacks   -> zygote      common=        old=1044662 new=469762 55.0%
zygote      -> zygotes     common=zygote  old=1044670 new=469765 55.0%
zygotes     -> zygotic     common=zygot   old=1044678 new=469769 55.0%
zygotic     -> zymase      common=zy      old=1044685 new=469775 55.0%
zymase      -> zymogenic   common=zym     old=1044695 new=469783 55.0%
zymogenic   -> zymology    common=zymo    old=1044704 new=469789 55.0%
zymology    -> zymolysis   common=zymol   old=1044714 new=469795 55.0%
zymolysis   -> zymoplastic common=zymo    old=1044726 new=469804 55.0%
zymoplastic -> zymoscope   common=zymo    old=1044736 new=469811 55.0%
zymoscope   -> zymurgy     common=zym     old=1044744 new=469817 55.0%
zymurgy     -> zyzzyva     common=zy      old=1044752 new=469824 55.0%
zyzzyva     -> zyzzyvas    common=zyzzyva old=1044761 new=469827 55.0%

To speed lookup, there was a 26-entry table in memory which held the starting offsets for words beginning with a, b, c, ..., z. The words at these offsets always had 0 as the first byte as they had no letters in common with the previous word.

This seems to be sort of a trie but without the pointers, which would surely get space-expensive if every character in the tree had a 4-byte pointer associated with it.

Mind you, this was from my CP/M days where memory was much scarcer than it is now.

+1 - thanks for sharing a clever algorithm. BTW: back then, my memory's reliability more than compensated for scarcity.... :-) — Adam Liss, Dec 11 '08 at 02:35

score 6 · Answer 2 · answered Dec 11 '08 at 03:35

6

A Patricia trie may be more appropriate:

http://en.wikipedia.org/wiki/Patricia_tree

My (fuzzy) memory tells me there were used in some of the early full-text search engines ...

Paul.

answered Dec 11 '08 at 03:35

Paul W Homer

2,728
1
19
25

score 3 · Accepted Answer · answered Dec 11 '08 at 02:05

3

What are you doing? If it's spell checking, you could use a bloom filter - see this code kata.

answered Dec 11 '08 at 02:05

Mike Scott

12,274
8
40
53

I was going to suggest a Bloom filter, too, but given his goals, I don't think a Bloom filter would work. Bloom filters won't answer with a definitive yes/no if a word is in the list, and it won't allow for the generation of a neighborhood. – mipadi Dec 12 '08 at 15:04
A bloom filter *will* answer a definitive no if the word *isn't* in the list. Yeah, the neighbourhood requirement was added later :) – Mike Scott Dec 12 '08 at 16:00

Eugene Yokota · Answer 4 · 2008-12-11T02:14:58.993

1

You still have to maintain the tree structure itself with Trie. Huffman encoding the alphabet or N-letters (for common forms like "tion", "un", "ing") can take advantage of the occurrence frequency in your dictionary and compress the entry to bits.

edited Dec 11 '08 at 02:14

answered Dec 11 '08 at 02:05

Eugene Yokota

94,654
45
215
319

score 1 · Answer 5 · answered Dec 11 '08 at 04:17

Completely wild idea... (i.e. most likely very wrong)

How about storing the words as a tree of all possible letter combinations?

Then each "word" only costs a single char and two pointers (one to the char and one to a terminator.) This way the more letters they have in common the less the cost for each word.

      . .
     / /
    r-p-s-.
   /\\
  a  \s-.
 /    t-.
c      \
        s-.

car carp carps cars cart carts

So for 9 chars and 14 pointers we get 6 "words" totalling 25 letters.

Searches would be quick (pointer lookups instead of char comparisons) and you could do some stemming optimisations to save even more space...?

EDIT: Looks like I reinvented the wheel. ;-)

score 1 · Answer 6 · answered Dec 12 '08 at 14:58

1

Related to Paul's post:

Any reason why you can't consider a Trie in your case? If it's just an implementaiton issue, here is a tight implementation of Patricia trie insert and search in C (from NIST):

Patricia Insert in C

Patricia Search in C

answered Dec 12 '08 at 14:58

Rich

12,068
9
62
94

Space-Efficient Data Structure for Storing a Word List?

6 Answers6

Linked