2

What is the leanest way to store a (big) full-text index that supports lookup of incomplete words? For example, an index lookup for colo should return Colorado (among other things). For context, I am indexing about 60,000 geographical entities (countries, regions/states, metro areas, and cities).

In my first attempt I indexed all substrings in a word starting with the first character from two characters in length up to the full word. For example, for the word "Colorado", I created the following index entries:

co
col
colo
color
colora
colorad
colorado

But that resulted in 160,000 index entries. I'm trying to reduce this down to something more reasonable while retaining the ability to match on incomplete words and keeping the number of index entries from blowing up. What optimizations should I consider to make the index smaller?

Chris Calo
  • 7,518
  • 7
  • 48
  • 64
  • Are you looking for solutions based on other packages (not MySQL)? Chen Li – Chen Li Nov 18 '11 at 07:58
  • I'm trying to roll my own quick and dirty solution using static text files, but if there's a straightforward and lightweight way to do this using a database package, I'd love to hear it. I guess I would be most interested in a Node.js or Python AppEngine solution. – Chris Calo Nov 19 '11 at 14:29
  • Look up Tries. A Trie possibly the best structure for your need. – Mikos Nov 20 '11 at 00:27

2 Answers2

3

My recommendation is to use a space-compact version of Trie, e.g., Radix Tree. There is a good implementation here in python.

radix tree

Web service

You can set up a separate web server for providing this lookup service, e.g., using Flask.

Sample code

Some sample codes to

  • load predefined place names using python-radix-tree and
  • complete a prefix to the point where ambiguity starts and
  • find all prefix matches up to 10 records.

are at below:

from radix_tree import RadixTree

locations = [
    "los angeles",
    "san diego",
    "san francisco",
    "san marino",
    "santa monica"
]

trie = RadixTree()
for loc in locations:
    trie.insert(loc, loc)

print trie.complete("s")
print trie.search_prefix('san', 10)

Result of sample code

san
['santa monica', 'san diego', 'san francisco', 'san marino']
Community
  • 1
  • 1
greeness
  • 15,956
  • 5
  • 50
  • 80
0

I think you should only branch a node if it has two children, e.g. no branching on 'colorad'.

I think you should also be able to keep it all in one file to avoid paying 4KB in overhead for each few bytes you store and even 60,000 objects are not going to be very large, an average of 30 bytes per line gives you ~1.8 MB :)

abcde123483
  • 3,885
  • 4
  • 41
  • 41
  • I'm not sure what you mean by no branching on 'colorad'. If a user types [colorad], I want the index lookup to return things that have the word Colorado in them. How would that work without an entry for 'colorad'? Are you suggesting sending the original objects to the browser or the generated index? The generated index has 160,000 entries. – Chris Calo Nov 17 '11 at 21:30