0

I have an interesting SQL task and though I would ask the community if anyone knows a fast way to accomplish it. I have 2 slow solutions, but I'm wondering if I am missing something faster.

Here is the task:

Given a list of records in a table, table A, with a column that references the primary key of another table, table B, logically speaking only though this is a MyISAM without foreign keys, we want to dedupe table B, and update table A to use the canonical deduped value from table B, and then delete all but the canonical id records from table B.

This might be easier illustrated via a small example. Lets say table A is a person table, and table B is a city table. Lets also say that there are records in the city table that are duplicates and need deduping. Lets say row 1 and row 2 of table B both refer to Los Angeles.

Then in the person table, we want to update all persons in Los Angeles with city id 2, to have city id 1, and delete the duplicate value from the city table with city id 2.

There may be many such rows representing the duplicated value, not just 2, you get the point. Right now, I am querying out all the cities from the city table, grouping them into equivaslence classes, looping over each equivalence class, nominating the canonical version in this case just choose the first, and performing 2 queries, the updated and the delete:

update person set city_id = $canonical_city_id where city_id in ($list_of_dupes)

Then

delete from city where city_id in ($list_of_dupes) and city_id != $canonical_city_id

I think there may be a faster way since we don't care which id is canonical, it could be the first, the in, or a random, doesn't matter. Can you think of a way to do this whole job in 1 SQL statement? What do you think is the fastest way?

David Williams
  • 8,388
  • 23
  • 83
  • 171
  • This sounds like a one-time data migration problem. Why do you care if you can achieve this in a single query as opposed to a multi-query script? What you should care about is 1) how to achieve this without loss of data accuracy and 2) how can you do this while limiting impact to production use of the database. – Mike Brant Mar 14 '14 at 19:14
  • Yeah, those are good points. – David Williams Mar 14 '14 at 20:30

0 Answers0