0

Given the following tables (each containing appr. 2 mio datasets):

movie(title, genre, price)
ground_truth_movie(title, genre)

movie:

|           title           |   genre   | price |
|***************************|***********|*******|
|       Bria nsau r us      |  History  | 10.2  |
|  BoJch o BillyOut6 tt e d | Animation | 10.2  |
|M on l igh t on th6  R ange|  Fantasy  | 10.2  |
|    Pret  a  Teleport er   | Animation | 10.2  |
|           ...             |    ...    |  ...  |

ground_truth_movie:

|           title           |   genre   |
|***************************|***********|
|       Briansaurus         |  History  |
|  Broncho Billy Outwitted  | Animation |
|  Moonlight on the Range   |  Fantasy  |
|     Pret a Teleporter     | Animation |
|           ...             |    ...    |

My task is to clean the titles in the movie table based on the titles in the ground_truth_movie table.

As the titles in the movie table are written completely wrong the best comparing function I found for this case was the levenshtein. The following SQL script I have so far:

SELECT m.title, (
SELECT g.title
  FROM ground_truth_movie g
  ORDER BY levenshtein(g.title, m.title) ASC, g.title
LIMIT 1
)
FROM movie m

As this script takes extremely long on a reduced dataset I thought about speeding it up in some way...?

Are there any other concepts of comparing columns with ground truths?

I'm a newbie in this field and I think my first try was not the best ;)

Thanks for help

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Dario Behringer
  • 287
  • 2
  • 3
  • 10
  • By '2 mio datasets' do you mean 2 million rows in each table? If not, could you say approx size of datasets? Do you have the option of working with the datasets outside of the DB and then updating it separately? Being able to load everything into memory could allow for both more flexibility in your matching logic and greater performance. Assuming you could at least count on the first character being accurate, you could sort on the sub-selection of movies starting with that character (having an index would help there, obviously). – barry-johnson Apr 04 '14 at 20:18
  • Could you give some representative data example? – wumpz Apr 04 '14 at 20:19
  • Die you try a proper join? `select m.title, g.title from movie m join ground_truth_movie g on levenshtein(g.title, m.title) < 2` (replace `< 2` with whatever threshold you need) –  Apr 04 '14 at 20:41
  • @wumpz: Just added some demo data to the question above – Dario Behringer Apr 04 '14 at 20:46
  • @barry: yes, 2 million rows in each table. Exporting the data would not be a good idea, because this would be to big for my memory. thanks anyway – Dario Behringer Apr 04 '14 at 20:48
  • I have also tried the proper join, but didn't find a good solution. As I have to replace all the titles in the movie table it is not sufficient to define a constant threshold, because in this case I would not be able to correct the movies with a bigger one. – Dario Behringer Apr 04 '14 at 20:55
  • Are you aware of trigraphs and trigrams in PostgreSQL: http://www.postgresonline.com/journal/archives/169-Fuzzy-string-matching-with-Trigram-and-Trigraphs.html – wumpz Apr 04 '14 at 21:08
  • already used trigrams and trigraphs but unfortunately the data are so corrupt that the results are not satisfactory in this case – Dario Behringer Apr 04 '14 at 21:59
  • What about the full text search? Maybe the "broken" values are still good enough for that. –  Jul 07 '14 at 11:11

0 Answers0