2

I'm thinking of implementing a small search engine. However I'm not sure how search engines do word segmentations.

My thoughts are like this:

  1. Build a word dictionary containing popular words
  2. For each sentence in the html document, break the words by spaces
  3. Do a linear search to check whether some of the words are in the dictionary. If they are, these are keywords of that page.
  4. Let the keywords be DB tables. Store the url in all corresponding keywords tables

So let's say we have a sentence "I invited her to have dinner in a local restaurant near downtown." The words excluding the stop ones are: {invited, dinner, local, restaurant, downtown}

The dictionary only contains words {invite, dinner, restaurant}

Here are the problems:

  1. How to handle the words outside the dictionary? (e.g. downtown)
  2. How to deal with past tense, plural forms, etc.? Should I store all words with certain prefix together? (e.g. "invite" would contain "invites, invited, invitation...") Then what about words like "back" and "backwards"?
  3. How to handle queries like "local restaurant"? Simply combining results from "local" and "restaurant" does not seem to be a good solution, while storing "local restaurant" as another keyword table may result in a lot more duplicates and bringing difficulties in word segmentation.
  4. Any better ways than my thoughts?

Any comments are welcome. Thanks!

hippietrail
  • 15,848
  • 18
  • 99
  • 158
NSF
  • 2,499
  • 6
  • 31
  • 55
  • 4
    there are numerous search engines implemented in java, Lucene and Solr as a bundle are well documented now, has a very active support community and the source code is available too. Why not study a well working solution? Good luck. – shellter Feb 22 '13 at 22:51
  • Nice meaty challenge. keep us posted!!! – Caffeinated Feb 22 '13 at 23:04
  • @shellter Thanks. Just learned that. Since it contains so many packages, could you kindly tell me where the breakthroughs are and what modules should I mainly take a look (cuz it might take a long time reading all packages of code from the beginning to the end)? – NSF Feb 22 '13 at 23:30
  • 1
    I'd highly recommend the Manning Publications 'Lucene in Action Ed 2'. (no affilation with that company). Lucene is the core engine that parses your data and solves the issues you're interested in. Adding in the Solr components is about making the search engine work in larger contexts (ie enterprise level search). Good luck. – shellter Feb 23 '13 at 01:00

0 Answers0