Fuzzy match string with 1 million rows

Question

I have a database with 1 million rows and based on a user's input I need to find him the most relevant matches.

The way the code was written in the past was by using the library fuzzywuzzy. A ratio between 2 strings was calculated in order to show how similar the strings were.

The problem with that is that we had to run the ratio function for each row from the database, meaning 1 million function calls and the performance is really bad. We've never thought that we'd get to the point of having this much data.

I am looking for a better algorithm or solution for handling the search in this case. I've stomped upon something called TF-IDF (Term Frequency-Inverse Document Frequency). It was described as a solution for "fuzzy matching at scale", way faster.

Unfortunately I couldn't wrap my mind around it and completely understand how does it work, and the more I read about it, the more I think that this is not what I need, since all the examples that I've seen are trying to find similar matches between 2 lists, not 1 string and 1 list.

So, am I on the wrong path? And if so, could you please give me some ideas on how can I handle this scenario? Unfortunately, Full Text Search works only with exact matches, so in our case fuzzy is definitely the way we want to go.

And if you're going to propose the idea of using a separate search engine, we don't want to add a new tool to our infrastructure just for this.

when you already have code using fuzzywuzzy, you might want to try replacing it with https://github.com/maxbachmann/RapidFuzz. I is a faster drop in replacement which is MIT licensed. See e.g. https://stackoverflow.com/questions/71216091/how-i-fast-up-my-code-instead-of-using-for-loop/71220687#71220687 — maxbachmann, Jul 03 '22 at 14:24

Fuzzy match string with 1 million rows

0 Answers0