I want to find similar images in a very large dataset (at least 50K+ images, potentially much more). I already successfully implemented several "distance" functions (hashes compared with L2 or Hamming distance for example, image features with % of similarity, etc) - the result is always a "double" number.
What I want now is "grouping" (clustering?) images by similarity. I already achieved some pretty good results, but the groups are not perfect: some images that could be grouped with others are left aside, so my method is not that good.
I've been looking for a solution these last 3 days, but things are not so clear in my head, and maybe I overlooked a possible method?
I already have image pairs with distance : [image A (index, int), image B (index, int), distance (double)], and a list of duplicates (image X similar to images Y, Z, T, image Y similar to X, T, G, F --- etc).
My problem:
- find a suitable and efficient algorithm to group images by similarity from the list of duplicates and the pairwise distances - for me the problem is not really spatial because image indexes A and B are NOT coordinates, but there is a 1-n relation between images - one method I found interesting is DBSCAN, or maybe hiearchical solutions would work?
- use an efficient structure that is not too memory-hungry, so full matrices of doubles are excluded (50K x 50k or 100k x 100k, or worse 1M x 1M is not reasonable, the more images there are the more matrices "eat" memory, and what's more the matrix would be symmetric because "image A similar to image B" is the same as "B similar to A" so there would be a terrible waste of memory space)
I'm coding with C++, using Qt6 for the interface and OpenCV 4.6 for some image functions, some hashing methods, etc.
Any idea/library/structure to propose? Thanks in advance.
EDIT - to better explain what I want to achieve
Images are the yellow circles. Image 1 is similar to image 4 with a score=3 and to 5 with a score=2 etc
The problem is that image 4 is also similar to image 5, so image 4 is more similar to 1 than 5. The example I put here is very simple because there are no more than 2 similar images for each image. With a bigger sample, image 4 could be similar to n images... And what about equal scores?
So is there an algorithm to create groups of images, so that no image is listed twice?