I'm implementing phrase and keyword search together (most likely this kind of search has a name, but I don't know it). To exemplify, the search I like turtles should match:
I like turtles
He said I like turtles
I really like turtles
I really like those reptiles called turtles
Turtles is what I like
In short, a string must contain all keywords to match.
Then comes the problem of sorting the search results.
Naively, I'm assuming that the closest the matches are to the beginning of the result AND to the original query, the better the result. How can I express this code?
My first approach was to assign a score for each keyword in each result based on how close the keyword is to an expected position, based in the original query. In pseudo-code:
score(result,query) {
keywords = query.split(" ");
score = 0
for i to keywords.length() {
score += score(result,query,keywords,i)
}
return score
}
score(result,query,keywords,i) {
index = text.indexOf(keywords[i])
if (i == 0) return index;
previousIndex = text.indexOf(keywords[i-1])
indexInSearch = query.indexOf(keywords[i])
previousIndexInSearch = query.indexOf(keywords[i-1])
expectedIndex = previousIndex + (indexInSearch - previousIndexInSearch)
return abs(index - expectedIndex)
}
The lower the score the better the result. The scores for the above examples seem decent enough:
I like turtles = 0
I really like turtles = 7
He said I like turtles = 8
I really like those reptiles called turtles = 38
Turtles is what I like = 39
Is this a viable approach to sort search results?
Leaving any kind of semantic analysis aside, what else could I be considering to improve it?