Improve performance while using levenshtein and soundex algorithms on search in MySQL

Question

We are trying to upload data from Excel to Database. Before uploading, we would like to preview the data with the count of Match status(Eg: No match, Similar match, Exact match) while comparing with our database.

The below query is taking 3 minutes of time to fetch the information from the database for 100 rows. We will have a case where the user can upload more than 5K rows of data. Please let us know your suggestions to improve the performance of the below query.

select IF(
    count(distinct ID) <= 0, (
        select case when count(ID) > 0 then 'Similar Match' else 'No Match' end as MatchType from masterTable where (
            soundex(BarCode) like soundex('12069B0') or soundex(ProductName) like soundex('FreezerZX')
        ) and (
            levenshtein(BarCode,'12069B0') < 3 or (levenshtein(ProductName,'FreezerZX') < 3)
        )
    ), 
    'Exact Match'
) as MatchType from masterTable where BarCode= '12069B0' and ProductName= 'FreezerZX';

Why would you use Soundex on unpronounceable things like '12069B0'? — Matt Gibson, May 28 '18 at 06:07
I really wouldn't bother with SOUNDEX. Also, it's obviously meaningless for one soundex to be *like* another. Either they are equal, or they are not. — Strawberry, May 28 '18 at 06:30
Performance is improved a bit by removing soundex for Barcode, but to fetch results from 5K records we require more performance. And we cannot use equal as the expected searches could be '12069 B0' or '120-69B0'. Kindly let know another alternative/ suggestions for the query above. — Alekhya Revalla, May 28 '18 at 06:50
SOUNDEX '120-69B0' is no more (or less) *'like'* '12069 B0' than it is *'equal'* to it. — Strawberry, May 28 '18 at 07:16

score 0 · Answer 1 · answered May 29 '18 at 15:26

Without a simple algorithm for "similar match", I recommend you abandon doing the entire task with the computer. The human brain is very good at similarity tests, so let's add it in...

Sort the data by a column that you want to check for dups or near-dups in. Display that list vertically. Visually skim the list.

Repeat with REVERSE(col). This will spot pairs that had radical differences early in the string, but match up better at the end.

Repeat for the other columns, as needed.

Improve performance while using levenshtein and soundex algorithms on search in MySQL

1 Answers1