I need to create simple search engine for my application. Let's simplify it to the following: we have some texts (a lot) and i need to search and show relevant results.
I've based on this great article extend some things and it works pretty well for me.
But i have problem with stemming words to terms. For example words "annotation", "annotations" etc. will be stemmed to "annot", but imagine you try search something, and you will see unexpected results:
- "anno" - nothing
- "annota" - nothing etc.
Only word "annot" will give relevant result. So, how should i improve my search to give expected results? Because "annot" contains "anno" and "annota" is slightly more than "annot". Using contains all the time obviously isn't the solution
If in first case i can use some Ternary search tree, in second case i don't know what to do.
Any ideas would be very helpful.
UPDATE
oleksii has pointed me to n-grams here, which may works for me, but i don't know how to properly index n-grams.
So the Question:
- Which data structure would be the best for my needs
- How properly index my n-grams