1

Let's say we have MySQL table Image with following columns

  • id
  • user_id
  • p_hash

I know how to calculate hamming distance (to reveal similar images) between newly inserted row's perceptual hash and all existing data in table. SQL query looks like this:

SELECT `Image`.*, BIT_COUNT(`p_hash` ^ :hash) as `hamming_distance`
FROM `Image`
HAVING `hamming_distance` < 5

I want to do the same to every existing image.(to check if there are similar images in the database) So, I have go through every row of the Image table, do the same process as above and find similar images from the table.

Now the question is, after whole procedure I want to get similar image groups only if elements of each group has at least one different user_id id?

So if, found group of similar images belongs to one user, then skip it. But if it belongs to multiple different users then return it as one of results.

Please help to figure out.

Donnie Sparko
  • 59
  • 1
  • 1
  • 5

1 Answers1

0

Sounds like you want a self-join.

SELECT i1.id, GROUP_CONCAT(i2.id) AS similar_images
FROM Image AS i1
JOIN Image AS i2 ON i1.user_id != i2.user_id AND BIT_COUNT(i1.`p_hash` ^ i2.p_hash) < 5
GROUP BY i1.id

DEMO

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • 1
    I don't think this is quite what the OP was after, but a sample data set would doubt clarify the matter – Strawberry Jun 15 '18 at 06:34
  • How about skipping similar images if its belongs to same user? I need similars only if one of image belongs to one user, another one belongs to different user. – Donnie Sparko Jun 15 '18 at 08:51
  • I changed query to query below but it doesn't return any result `SELECT i1.id, GROUP_CONCAT(i2.id) AS similar_images FROM Image AS i1 JOIN Image AS i2 ON i1.id != i2.id AND i1.user_id != i2.user_id AND BIT_COUNT(i1.`p_hash` ^ i2.p_hash) < 5 GROUP BY i1.id` – Donnie Sparko Jun 15 '18 at 09:50
  • No need for `i1.id != i2.id` if the user IDs are different. But if I understand what you're looking for I think this should work. Are you sure you have images from different users whose hashes are similar? I don't think the datatype matters. – Barmar Jun 15 '18 at 15:47
  • Can I apply this query also to p_hash column with bin(64) type? example values are: 1000000000000000000010100000000000101010000000000010101010000000 – Donnie Sparko Jun 16 '18 at 00:50
  • I don't see why it would make a difference. – Barmar Jun 16 '18 at 00:53
  • Try modifying my sqlfiddle with 64-bit data and see what happens. – Barmar Jun 16 '18 at 00:53
  • I did. It takes very long time: around 10 minutes. Is there any way to make it faster? – Donnie Sparko Jun 16 '18 at 23:21
  • Probably not. It's comparing every row to almost every other row. There's no way to make use of indexes for something like this. – Barmar Jun 16 '18 at 23:49
  • Here's an idea. Add another column containing the bit count of the hash, and index it (if you're using MySQL 5.7 or newer, this can be a generated column). Then you only need to test compare the hashes of rows that have similar bit counts. it might be able to use an index if you do `i1.hash_bit_count BETWEEN i2.hash_bit_count - 5 AND i2.hash_bit_count + 5` – Barmar Jun 16 '18 at 23:52
  • and one more question: is perceptual hash applicable to icons also? or there are chances for mistake on icons – Donnie Sparko Jun 18 '18 at 20:22
  • I don't know much about perceptual hash technology. But I guess with a small image, it only takes a few changes to distort it significantly, but this might not change the hash very much. – Barmar Jun 18 '18 at 20:36