0

I have a database table for storing restaurant names and the city they are located in. Example:

name | city

eleven madison park | NYC

gramercy tavern | NYC

Lotus of Siam | TOK

The Modern | LA

ABC Kitchen | LA

Now when there is an incoming entry before INSERT, if there is no similar restaurant name in the same city, I want to go ahead and perform the insert.

But if the entry is like, say { name: "Eleven Madison", city: "NYC" }, I want to find similar entries in "name" column with the same city, in this example "eleven madison park" in "NYC", I want to do the insert and store a new row in 'conflicts' table - with the IDs of these restaurants (last insert id and similar row id)

I used the Levenshtein distance algorithm, with the following SQL query:

SELECT id, levenshtein_ratio(name, 'Eleven Madison') AS levsh from restaurants
where
city_name = 'NYC'
order by levsh asc
limit 0, 1

Then I set a threshold of 8, and if levsh is less than 8, then I mark it as a conflict i.e. insert a new record in 'conflicts' table. This query was working fine until the table grew to 1000 records. Now this query takes 2 seconds to finish.

I understand that this is because I am calculating levenshtein_ratio for all the restaurants in the city - and I only need to apply the ratio function only on similar names for ex. the ones containing 'Eleven' , 'Madison',.. or even better would be if i can do something like

WHERE city_name = 'NYC' AND SOUNDEX(any word in `name`) = SOUNDEX(any word in 'Eleven Madison')

Please help with suggestions on how to improve and optimize this query, and if possible any better approach to what I am doing.

Thanks

Community
  • 1
  • 1
Ananth
  • 4,227
  • 2
  • 20
  • 26
  • soundex to filter "somewhat similar",t hen apply levenshtein to the results of the soundex query. soundex() is dead cheap (in cpu cycle terms) to apply compared to levenshtein, so anything you can do to reduce the number of lev() comparisons the better. e.g. `select * from (select soundex query here....) where levenshtein(...)` – Marc B Jul 07 '15 at 17:14
  • Are you using the code in the link you provided? Perhaps that is not the most efficient one. Here is another link you can try: https://github.com/jmcejuela/Levenshtein-MySQL-UDF. Also you may want to be careful about UTf-8 and varchar implicit conversion. Make sure you are using the same character set throughout. – dan b Jul 07 '15 at 19:46

0 Answers0