I have a database table for storing restaurant names and the city they are located in. Example:
name | city
eleven madison park | NYC
gramercy tavern | NYC
Lotus of Siam | TOK
The Modern | LA
ABC Kitchen | LA
Now when there is an incoming entry before INSERT, if there is no similar restaurant name in the same city, I want to go ahead and perform the insert.
But if the entry is like, say { name: "Eleven Madison", city: "NYC" }, I want to find similar entries in "name" column with the same city, in this example "eleven madison park" in "NYC", I want to do the insert and store a new row in 'conflicts' table - with the IDs of these restaurants (last insert id and similar row id)
I used the Levenshtein distance algorithm, with the following SQL query:
SELECT id, levenshtein_ratio(name, 'Eleven Madison') AS levsh from restaurants
where
city_name = 'NYC'
order by levsh asc
limit 0, 1
Then I set a threshold of 8, and if levsh is less than 8, then I mark it as a conflict i.e. insert a new record in 'conflicts' table. This query was working fine until the table grew to 1000 records. Now this query takes 2 seconds to finish.
I understand that this is because I am calculating levenshtein_ratio for all the restaurants in the city - and I only need to apply the ratio function only on similar names for ex. the ones containing 'Eleven' , 'Madison',.. or even better would be if i can do something like
WHERE city_name = 'NYC' AND SOUNDEX(any word in `name`) = SOUNDEX(any word in 'Eleven Madison')
Please help with suggestions on how to improve and optimize this query, and if possible any better approach to what I am doing.
Thanks