2

I'm implementing phrase and keyword search together (most likely this kind of search has a name, but I don't know it). To exemplify, the search I like turtles should match:

I like turtles
He said I like turtles
I really like turtles
I really like those reptiles called turtles
Turtles is what I like

In short, a string must contain all keywords to match.

Then comes the problem of sorting the search results.

Naively, I'm assuming that the closest the matches are to the beginning of the result AND to the original query, the better the result. How can I express this code?

My first approach was to assign a score for each keyword in each result based on how close the keyword is to an expected position, based in the original query. In pseudo-code:

score(result,query) {
    keywords = query.split(" ");
    score = 0
    for i to keywords.length() {
       score += score(result,query,keywords,i)
    }
    return score
}

score(result,query,keywords,i) {
    index = text.indexOf(keywords[i])
    if (i == 0) return index;

    previousIndex = text.indexOf(keywords[i-1])
    indexInSearch = query.indexOf(keywords[i])
    previousIndexInSearch = query.indexOf(keywords[i-1])

    expectedIndex = previousIndex + (indexInSearch - previousIndexInSearch)

    return abs(index - expectedIndex)
}

The lower the score the better the result. The scores for the above examples seem decent enough:

I like turtles = 0
I really like turtles = 7
He said I like turtles = 8
I really like those reptiles called turtles = 38
Turtles is what I like = 39

Is this a viable approach to sort search results?

Leaving any kind of semantic analysis aside, what else could I be considering to improve it?

hpique
  • 119,096
  • 131
  • 338
  • 476
  • There are various measures of string similarity; see e.g. [the Wikipedia category](http://en.wikipedia.org/wiki/Category:String_similarity_measures). – jonrsharpe Aug 18 '14 at 13:34
  • @jonrsharpe Wouldn't string distance algorithms penalize longer search results? Or are you thinking of a particular string similarity algorithm from that list? – hpique Aug 18 '14 at 13:56
  • This seems open to a lot of incidental variation. "I don't like turtles" would score higher than "I really like turtles". – Jerry Coffin Aug 20 '14 at 21:23
  • @JerryCoffin Well, from a syntactical point of view, "I don't like turtles" is 1 character closer to the original query than "I really like turtles". Of course, a search engine should consider semantics, but I prefer to leave that out of the scope of the question. – hpique Aug 21 '14 at 08:19
  • I think if I were doing it, at the very least I'd look only at positions of entire words, not the number of letters in an individual word. I'd probably also do some preprocessing like removing noise words and stemming the words that remained. – Jerry Coffin Aug 21 '14 at 11:51

0 Answers0