Given the following tables (each containing appr. 2 mio datasets):
movie(title, genre, price)
ground_truth_movie(title, genre)
movie:
| title | genre | price |
|***************************|***********|*******|
| Bria nsau r us | History | 10.2 |
| BoJch o BillyOut6 tt e d | Animation | 10.2 |
|M on l igh t on th6 R ange| Fantasy | 10.2 |
| Pret a Teleport er | Animation | 10.2 |
| ... | ... | ... |
ground_truth_movie:
| title | genre |
|***************************|***********|
| Briansaurus | History |
| Broncho Billy Outwitted | Animation |
| Moonlight on the Range | Fantasy |
| Pret a Teleporter | Animation |
| ... | ... |
My task is to clean the titles in the movie table based on the titles in the ground_truth_movie table.
As the titles in the movie table are written completely wrong the best comparing function I found for this case was the levenshtein. The following SQL script I have so far:
SELECT m.title, (
SELECT g.title
FROM ground_truth_movie g
ORDER BY levenshtein(g.title, m.title) ASC, g.title
LIMIT 1
)
FROM movie m
As this script takes extremely long on a reduced dataset I thought about speeding it up in some way...?
Are there any other concepts of comparing columns with ground truths?
I'm a newbie in this field and I think my first try was not the best ;)
Thanks for help