0

I have a text file, which has multiple lines with details of an object. I want to find the score of every string and want to check which string is more relevant to the user input. E.g. the text file contains

 This is not a blue car
 Blue or black car is here
 This is red car
 Red car is here

User input is red car.

How do I find the most relevant string? So that the output is order by relevance and looks like this

 This is red car
 Red car is here
 This is not a blue car
 Blue or black car is here
mike
  • 4,929
  • 4
  • 40
  • 80

2 Answers2

1

In order to determine a relevance score for any string out of a given set of strings against a query string, in your case 'red car', you need an information retrieval similarity measure.

Okapi BM25 is such a similarity measure. Since this delves fairly deep into the field of text indexing, you'll probably have to do some studying, before you can implement it yourself.

Below is the definition of the algorithm

Okapi BM25 algorithm

D is the document, i.e. in your case a single line. Q is the query, which consists of all the q_i, and IDF is the inverse document frequency.

The intuition behind this algorithm is to create a score for each term q_i in Q, which is based on the total occurrences in all strings, i.e. strings with high occurrences get low ranking, since they carry no information (in large English texts this would normally be strings like be, have, etc.), and based on the occurrence within the string your searching. That means if a small text contains a given term, e.g. rocket, often. The term is more significant to the small text, than I would be to a text of 10 times the length even if the term occurrences 2 times as often.


If you want more information, you can read the linked wiki article, or read following paper for a start: Inverted files for text search engines.


If you don't want to do the search yourself. You can use a library, e.g. whoosh. As it says on its website

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python

Further more it has a

Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.

That means you can change the similarity measure, that determines the relevance in order to get the behavior you want for your application. At least to some degree.


In perform a search, you have to create an index first, this is described here. Afterwards you can query the index as you desire. Refer to the documentation for more information and help with the library.

mike
  • 4,929
  • 4
  • 40
  • 80
0

For this particular problem, I would use simple Levenshtein distance. I have recently used it for exactly this kind of application (grouping similar queries together) and it worked well:

def normalized_edit_similarity(a, b):
    return 1.0 - editdistance.eval(a, b)/(1.0 * max(len(a), len(b)))

I used https://pypi.python.org/pypi/editdistance package. Note: editdistance.eval is plain Levenshtein distance, so I normalize it by dividing it by length of the longer string (standard way of normalizing Levenshtein distance).

LetMeSOThat4U
  • 6,470
  • 10
  • 53
  • 93