How to find most relevant strings in a textfile?

Question

I have a text file, which has multiple lines with details of an object. I want to find the score of every string and want to check which string is more relevant to the user input. E.g. the text file contains

 This is not a blue car
 Blue or black car is here
 This is red car
 Red car is here

User input is red car.

How do I find the most relevant string? So that the output is order by relevance and looks like this

 This is red car
 Red car is here
 This is not a blue car
 Blue or black car is here

You might be searching for something like an [Edit Distance](https://en.wikipedia.org/wiki/Edit_distance) — languitar, Apr 19 '17 at 08:08
Welcome to SO. Can you show us the code you have tried so far? — Karl Gjertsen, Apr 19 '17 at 08:08
"the output is order by relevance", You should define relevance first — Petar Petrovic, Apr 19 '17 at 08:17
You need a text index that uses a similarity measure for searching. Okapi BM25 is such a similarity measure. Maybe there are newer and better ones. You'll have to look yourself. — mike, Apr 19 '17 at 08:17
Relevance in the context of string search is the similarity. There are several similarity measures. For example the one I posted above. — mike, Apr 19 '17 at 08:24
@SyedAliJaffarXaidi Can you use a library or do you want/have to design the algorithm yourself? Is it enough order the strings by relevance or do you want to compute the score for every line? — mike, Apr 19 '17 at 08:31
@SyedAliJaffarXaidi I've revised the answer. Please comment, if there is something missing — mike, Apr 19 '17 at 10:13

mike · Answer 1 · 2017-04-20T15:05:37.107

In order to determine a relevance score for any string out of a given set of strings against a query string, in your case 'red car', you need an information retrieval similarity measure.

Okapi BM25 is such a similarity measure. Since this delves fairly deep into the field of text indexing, you'll probably have to do some studying, before you can implement it yourself.

Below is the definition of the algorithm

D is the document, i.e. in your case a single line. Q is the query, which consists of all the q_i, and IDF is the inverse document frequency.

The intuition behind this algorithm is to create a score for each term q_i in Q, which is based on the total occurrences in all strings, i.e. strings with high occurrences get low ranking, since they carry no information (in large English texts this would normally be strings like be, have, etc.), and based on the occurrence within the string your searching. That means if a small text contains a given term, e.g. rocket, often. The term is more significant to the small text, than I would be to a text of 10 times the length even if the term occurrences 2 times as often.

If you want more information, you can read the linked wiki article, or read following paper for a start: Inverted files for text search engines.

If you don't want to do the search yourself. You can use a library, e.g. whoosh. As it says on its website

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python

Further more it has a

Pluggable scoring algorithm (including BM25F), text analysis, storage, posting format, etc.

That means you can change the similarity measure, that determines the relevance in order to get the behavior you want for your application. At least to some degree.

In perform a search, you have to create an index first, this is described here. Afterwards you can query the index as you desire. Refer to the documentation for more information and help with the library.

Tweaking parameters. The answer contains a link to the Okapi BM25 wiki article, you can find info on the values of `k` and `b` there. — mike, May 15 '17 at 16:22

score 0 · Answer 2 · answered Jul 12 '17 at 16:49

For this particular problem, I would use simple Levenshtein distance. I have recently used it for exactly this kind of application (grouping similar queries together) and it worked well:

def normalized_edit_similarity(a, b):
    return 1.0 - editdistance.eval(a, b)/(1.0 * max(len(a), len(b)))

I used https://pypi.python.org/pypi/editdistance package. Note: editdistance.eval is plain Levenshtein distance, so I normalize it by dividing it by length of the longer string (standard way of normalizing Levenshtein distance).

How to find most relevant strings in a textfile?

2 Answers2