I have a relational SQL database, derived from data returned by a Tensorflow image classifier. I believe that there are a huge number of duplicate entries in both the image table and relational table and I would like to remove them while ensuring that I don't break any relationships.
For example, this relationships table:
image_id | term_id | weight
1 18 0.95336
2 18 0.49187
2 25 0.35451
3 18 0.84148
3 18 0.84148
3 18 0.84148
4 18 0.84148
5 7 0.25471
Should become:
image_id | term_id | weight
1 18 0.95336
2 18 0.49187
2 25 0.35451
3 18 0.84148
5 7 0.25471
And this image table:
ID | file_path | file_name
1 /folder/ imageDog.jpg
2 /folder/ imageMouse.jpg
3 /folder/subfolder/ imageCat.jpg
4 /folder/subfolder/ imageCat.jpg
5 /folder/subfolder/ imageSnake.jpg
Should become:
ID | file_path | file_name
1 /folder/ imageDog.jpg
2 /folder/ imageMouse.jpg
3 /folder/subfolder/ imageCat.jpg
5 /folder/subfolder/ imageSnake.jpg
Notice that that image with ID 4 is a duplicate of 3, so the image AND it's relationships need to be deleted. I think that in some cases, relationships might be missing for some duplicated images, in which case, only the one image with relationships should be kept. I hope this makes sense.