1

We ingest a lot of images from external sources. I would like to assure that already ingested images are not re-ingested in the backend. For this I was thinking of generating a GUID based on image's stream as follows

File.ReadAllBytes()

or

public byte[] imageToByteArray(System.Drawing.Image imageIn)
{
 MemoryStream ms = new MemoryStream();
 imageIn.Save(ms,System.Drawing.Imaging.ImageFormat.Gif);
 return  ms.ToArray();
}
enter code here

I was then thinking of making this into a CLR (if at all necessary) then save the GUID with the metadata of the image in SQL server. Not sure how accurately unique that GUID would be.

Any inputs?

Thanks

Buju
  • 93
  • 2
  • 9
  • 1
    You don't create a GUID based on input data, it's more like a random number. I think you're looking for some kind of hash. – Mark Ransom Oct 03 '12 at 03:44

2 Answers2

0

As @Mark Ransom suggested, you're confusing a GUID and a hash. A GUID is an identifier that is supposed to be unique. It's independent of any inputs, and is just something you can generate. A hash is supposed to be unique for unique inputs. In other words, identical inputs will have identical hashes, in the vast majority of cases.

A common hash algorithm to use is MD5. Here's a link to a similar question on SO.

Alternatively, you could avoid writing code by using existing command-line utilities, such as md5sum, sort and uniq.

Community
  • 1
  • 1
mpenkov
  • 21,621
  • 10
  • 84
  • 126
  • A hash (generally speaking) is *not* guaranteed to be unique. – Damien_The_Unbeliever Oct 03 '12 at 07:28
  • Generally speaking, you're correct. With current hash algorithms like MD5, collisions do occur, but are pretty rare in practice. So it's pretty close to being guaranteed, practically speaking. I've changed the wording of my answer to avoid misleading people. – mpenkov Oct 03 '12 at 07:33
  • While this is true, is there a way to convert an MD5 (or whatever other hash function result) into a GUID reliably? Or, to be more specific - convert an MD5 into a *128 bit value formatted into blocks of hexadecimal digits separated by a hyphen*? – macwier Oct 14 '21 at 10:19
0

Here's one solution for a "fingerprint string" algorithm.

As the post says, you will often want the same visual to map to the same string even if the file formats are different, or it's a different size. So this algorithm squashes the image into a 8x8 thumbnail with a 62-color palette (you could probably achieve the same thing with ImageMagick).

This transform leaves you with an image of 64 values ranging from 1 to 62.

In other words, a short base-62 string.

mahemoff
  • 44,526
  • 36
  • 160
  • 222