3

I have a very large list of Strings stored in a NoSQL DB. Incoming query is a string and I want to check if this String is there in the list or not. In case of Exact match, this is very simple. That NoSQL DB may have the String as the primary key and I will just check if there is any record with that string as primary key. But I need to check for Fuzzy match as well.

There is one approach to traverse every String in that list and check Levenshtein Distance of input String with the Strings in list, but this approach will result in O(n) complexity and the size of list is very large (10 million) and may even increase. This approach will result in higher latency of my solution.

Is there a better way to solve this problem?

Devil
  • 291
  • 1
  • 3
  • 11
  • Searching for a Fuzzy String is always complicated. It results in an high complexity, and I don't think there is a real good solution to avoid that. Is it possible to correct Fuzzy Strings befor searching? But which nonsql database do you use. Some of them provide a search function for fuzzy strings. Or you should try to use a SearchEngine like [ElasticSearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html) – GAlexMES Mar 13 '17 at 09:36
  • 1
    why don't you use phonetic algorithm like Soundex or Metaphone.You may give a try for it. – Abu Sufian Mar 13 '17 at 09:39
  • The Apache commons-text library provides some routines to do this such as cosine distance but it does sound like you want to at least use an embedded Lucene for this. Even with Lucene Levenshtein distance searches have a high cost, although Lucene improved this. – David George Mar 13 '17 at 10:53
  • The standard approach is to use n-grams. See the answer below for more details. – rghome Mar 13 '17 at 12:36

3 Answers3

2

Fuzzy matching is complicated for the reasons you have discovered. Calculating a distance metric for every combination of search term against database term is impractical for performance reasons.

The solution to this is usually to use an n-gram index. This can either be used standalone to give a result, or as a filter to cut down the size of possible results so that you have fewer distance scores to calculate.

So basically, if you have a word "stack" you break it into n-grams (commonly trigrams) such as "s", "st", "sta", "ack", "ck", "k". You index those in your database against the database row. You then do the same for the input and look for the database rows that have the same matching n-grams.

This is all complicated, and your best option is to use an existing implementation such as Lucene/Solr which will do the n-gram stuff for you. I haven't used it myself as I work with proprietary solutions, but there is a stackoverflow question that might be related:

Return only results that match enough NGrams with Solr

Some databases seem to implement n-gram matching. Here is a link to a Sybase page that provides some discussion of that:

Sybase n-gram text index

Unfortunately, discussions of n-grams would be a long post and I don't have time. Probably it is discussed elsewhere on stackoverflow and other sites. I suggest Googling the term and reading up about it.

Community
  • 1
  • 1
rghome
  • 8,529
  • 8
  • 43
  • 62
1

First of all, if Searching is what you're doing, then you should use a Search Engine (ElasticSearch is pretty much the default). They are good at this and you are not re-inventing wheels.

Second, the technique you are looking for is called stemming. Along with the original String, save a normalized string in your DB. Normalize the search query with the same mechanism. That way you will get much better search results. Obviously, this is one of the techniques a search engine uses under the hood.

Sean Patrick Floyd
  • 292,901
  • 67
  • 465
  • 588
  • 1
    He wants to do a Levenshtein distance, so stemming is not going to help there. It is more complicated that that. – rghome Mar 13 '17 at 09:40
  • @rghome I read that he's tried that approach, but not that it's a requirement – Sean Patrick Floyd Mar 13 '17 at 09:42
  • Thank you for your suggestion. My original approach is to use Levenshtein distance, but I am open to use any other approach as well, if it is better. For original approach, I need to parse the complete list. I am thinking if this could be optimized as the list is very big and I dont want to parse it for each query. – Devil Mar 13 '17 at 10:59
  • Stemming doesn't help to return result with one letter mistake like `water/woter` – mishadoff Mar 13 '17 at 13:38
  • @mishadoff true, but it's better than what he has now. Feel free to add your own more complete answer – Sean Patrick Floyd Mar 13 '17 at 13:39
1

Use Solr (or Lucene) could be a suitable solution for you?

Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, "~", symbol at the end of a Single word Term. For example to search for a term similar in spelling to "roam" use the fuzzy search:

roam~

This search will find terms like foam and roams.

Starting with Lucene 1.9 an additional (optional) parameter can specify the required similarity. The value is between 0 and 1, with a value closer to 1 only terms with a higher similarity will be matched. For example:

roam~0.8 

https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Community
  • 1
  • 1
freedev
  • 25,946
  • 8
  • 108
  • 125
  • 1
    just side tip: ElasticSearch and Solr are using Lucene at heart. @Devil – Bagus Tesa Mar 13 '17 at 09:43
  • Thank you for your suggestion! I see from the comments that Lucene provides both Exact and Fuzzy match and Solr and Elastic Search are both Lucene based. Is there a timeout feature as well in either Solr or Elastic Search which deletes the record after a fixed time? Also, I hope latency is not a problem in both – Devil Mar 13 '17 at 11:13
  • @Devil AFAIK there isn't any "delete records after fixed time" feature, but you could easily add a creation_timestamp field to your documents and filter results earlier than a specified date-time or/and delete periodically all the documents older than... – freedev Mar 13 '17 at 11:18