0

I want to find similar images in a very large dataset (at least 50K+ images, potentially much more). I already successfully implemented several "distance" functions (hashes compared with L2 or Hamming distance for example, image features with % of similarity, etc) - the result is always a "double" number.

What I want now is "grouping" (clustering?) images by similarity. I already achieved some pretty good results, but the groups are not perfect: some images that could be grouped with others are left aside, so my method is not that good.

I've been looking for a solution these last 3 days, but things are not so clear in my head, and maybe I overlooked a possible method?

I already have image pairs with distance : [image A (index, int), image B (index, int), distance (double)], and a list of duplicates (image X similar to images Y, Z, T, image Y similar to X, T, G, F --- etc).

My problem:

  • find a suitable and efficient algorithm to group images by similarity from the list of duplicates and the pairwise distances - for me the problem is not really spatial because image indexes A and B are NOT coordinates, but there is a 1-n relation between images - one method I found interesting is DBSCAN, or maybe hiearchical solutions would work?
  • use an efficient structure that is not too memory-hungry, so full matrices of doubles are excluded (50K x 50k or 100k x 100k, or worse 1M x 1M is not reasonable, the more images there are the more matrices "eat" memory, and what's more the matrix would be symmetric because "image A similar to image B" is the same as "B similar to A" so there would be a terrible waste of memory space)

I'm coding with C++, using Qt6 for the interface and OpenCV 4.6 for some image functions, some hashing methods, etc.

Any idea/library/structure to propose? Thanks in advance.

EDIT - to better explain what I want to achieve

Example

Images are the yellow circles. Image 1 is similar to image 4 with a score=3 and to 5 with a score=2 etc

The problem is that image 4 is also similar to image 5, so image 4 is more similar to 1 than 5. The example I put here is very simple because there are no more than 2 similar images for each image. With a bigger sample, image 4 could be similar to n images... And what about equal scores?

So is there an algorithm to create groups of images, so that no image is listed twice?

  • 1
    The problem is that clustering will probably always be imperfect, because some images will fall into several clusters. Have a look at my comment to a related issue: https://stackoverflow.com/questions/58462991/clustering-images-based-on-their-similarity/58564569#58564569 – Similar pictures Nov 06 '22 at 13:51
  • Thank you for the answer, but these special cases could be addressed by some extra computation, and that's exactly what I am doing for the moment, apart that the method I am using could be better at the beginning... Seeing your repo on github.io, you seem to have some experience on the subject. Would you have a clue to share with me? I'd like to test by myself. – AbsurdePhoton Nov 07 '22 at 18:07
  • It would be nice to see some screenshots of what you want and what you get. I do not fully understand your problem from the description. – Similar pictures Nov 08 '22 at 20:58
  • I am thinking about the problem. It is not trivial. In my clustering Demo linked from Github I simply add to a cluster all images that are similar to at least one image from the cluster. The similarity criteria is a threshold value. This approach means that one image only ends in one cluster, but it could potentially end up in another. I considered a better solution, which I abandoned for some reasons. But maybe I reconstruct here later. – Similar pictures Nov 11 '22 at 00:32
  • Just to make sure. Do you think you fully understand what image similarity is? Many people tend to mistakenly use pixel similarity for semantic similarity. Here is a great example to get an idea: https://code.flickr.net/2017/03/07/introducing-similarity-search-at-flickr/ – Similar pictures Nov 11 '22 at 00:46
  • No, this is not semantic similarity. The "distance" values are expressed as percentage of similarity between images, from hamming distance between image hashes or "good" matches between image features – AbsurdePhoton Nov 11 '22 at 10:31
  • I group the images like you by threshold (first pass), it works so-so, because similarity algorithms are not perfect, for me false positives ARE the problem. Without them, grouping would be natural... So I'm mainly working on combining several algorithm results to lower the probability of false positives. Converting distances to percentage of similarity helps a lot, because you can combine percentages, not hamming and L2 distances that are very different. I tried to set a second, third pass, trying to match the images inside a group for example, but couldn't find a way that works better. – AbsurdePhoton Nov 11 '22 at 11:03
  • But remember: my question was about clustering from DISTANCES (or percentages, the result is the same), and not about the way to obtain them. – AbsurdePhoton Nov 11 '22 at 11:05
  • Now I remember my own struggles. The problem is likely related to the constraints of Euclidean Metric Space. When trying to compare or cluster as in your picture with yellow circles, we would do better by using hyperbolic space. Have a look: https://nishantkumar94.medium.com/curious-case-of-hyperbolic-embeddings-part-i-2886d8a12e39 – Similar pictures Nov 12 '22 at 00:54
  • Also a video: https://www.youtube.com/watch?v=-ksbWExpWis – Similar pictures Nov 12 '22 at 00:59

1 Answers1

0

The answers to my own question:

  • about the structure itself : it is called an "undirected weighted graph" --- English is not my native language and I had a hard time to first find the right words, and then the solution was quickly found!
  • clustering: there are several algorithms associated to graphs so I'll try some of them

Many thanks to @Similar_Pictures for taking the time to answer me, and opening my eyes upon the fact that the better the similarity algorithm(s), the less is the need to use complicated clustering techniques...

I am actually testing how to combine several similarity techniques: each one has its defaults, but together some work best, using refined thresholds.