I'm thinking of implementing a small search engine. However I'm not sure how search engines do word segmentations.
My thoughts are like this:
- Build a word dictionary containing popular words
- For each sentence in the html document, break the words by spaces
- Do a linear search to check whether some of the words are in the dictionary. If they are, these are keywords of that page.
- Let the keywords be DB tables. Store the url in all corresponding keywords tables
So let's say we have a sentence "I invited her to have dinner in a local restaurant near downtown." The words excluding the stop ones are: {invited, dinner, local, restaurant, downtown}
The dictionary only contains words {invite, dinner, restaurant}
Here are the problems:
- How to handle the words outside the dictionary? (e.g. downtown)
- How to deal with past tense, plural forms, etc.? Should I store all words with certain prefix together? (e.g. "invite" would contain "invites, invited, invitation...") Then what about words like "back" and "backwards"?
- How to handle queries like "local restaurant"? Simply combining results from "local" and "restaurant" does not seem to be a good solution, while storing "local restaurant" as another keyword table may result in a lot more duplicates and bringing difficulties in word segmentation.
- Any better ways than my thoughts?
Any comments are welcome. Thanks!