-1

I have a relational SQL database, derived from data returned by a Tensorflow image classifier. I believe that there are a huge number of duplicate entries in both the image table and relational table and I would like to remove them while ensuring that I don't break any relationships.

For example, this relationships table:

image_id | term_id | weight
1 18 0.95336
2 18 0.49187
2 25 0.35451
3 18 0.84148
3 18 0.84148
3 18 0.84148
4 18 0.84148
5 7 0.25471

Should become:

image_id | term_id | weight
1 18 0.95336
2 18 0.49187
2 25 0.35451
3 18 0.84148
5 7 0.25471

And this image table:

ID | file_path | file_name
1 /folder/ imageDog.jpg
2 /folder/ imageMouse.jpg
3 /folder/subfolder/ imageCat.jpg
4 /folder/subfolder/ imageCat.jpg
5 /folder/subfolder/ imageSnake.jpg

Should become:

ID | file_path | file_name
1 /folder/ imageDog.jpg
2 /folder/ imageMouse.jpg
3 /folder/subfolder/ imageCat.jpg
5 /folder/subfolder/ imageSnake.jpg

Notice that that image with ID 4 is a duplicate of 3, so the image AND it's relationships need to be deleted. I think that in some cases, relationships might be missing for some duplicated images, in which case, only the one image with relationships should be kept. I hope this makes sense.

1 Answers1

1

You could use an aggregation function for obtain a single image_id eg: min(image_id) group by term_id, weight

  select min(image_id ) min_id
    , term_id 
    , weight
  from my_table  
  group by term_id, weight

and if you want delete the path not related to the min_id result you could use

delete  from your_file_path_table 
where ID  not in (
  select min(image_id ) 
     from my_table  
    group by term_id, weight
)
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107