0

I came across a sql code, which creates primary keys with hashbytes function and md5 algorithm. The code looks like this:

SELECT
     CONVERT(VARBINARY(32),
        CONVERT( CHAR(32),
            HASHBYTES('MD5', 
            (LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))
            ),
        2)
    ) 
FROM database.schema.table

I find it hard to understand for what is the result from hashbytes function is converted to char and then to varbinary, when we get directly varbinary from hashbytes function. Is there any good reason to do so?

ulie
  • 15
  • 3
  • 1
    There's no good reason for this code in the first place. A primary key that changes when a column value changes is useless. Never mind the *guaranteed* collisions - hashes by definition produce identical values for different data. – Panagiotis Kanavos Oct 12 '22 at 14:50
  • *"I came across a sql code"* Where did you come accross this? In your own environment? Did you speak to the author if so? On the internet? What was the source? – Thom A Oct 12 '22 at 14:50
  • 2
    @Larnu I wouldn't want to talk to the author of that code. It wouldn't go very well – Panagiotis Kanavos Oct 12 '22 at 14:53
  • I'm not sure I would want to either, @PanagiotisKanavos , but if the code *is* from their own environment and they *have* spoken to them, then we might get some insight into that person's (mis)understandings. If they've copy pasta'd it from a website, then if nothing else the OP isn't following the [referencing guidelines](https://stackoverflow.com/help/referencing). – Thom A Oct 12 '22 at 14:55
  • @Larnu, I have got this code from two different sources: from a student's project, which was curated by University professor during Database course, with the company where I work at, and from a colleague from another company, whom I asked for a sample of code, how they generate hash keys. I contacted a student from this project and he said, that they took this sample from previous student's projects and he had no idea why there is conversion into CHAR. The colleague from another company also has no idea, since he also copied it from someone from his company. – ulie Oct 12 '22 at 16:31
  • Because this "logic" comes from two different sources and one was even approved (overseen many times) by a University professor, I was wondering if there is something that I'm missing – ulie Oct 12 '22 at 16:34
  • 3
    *"was even approved by a University professor"* this honestly holds no value without knowing the professor; some of the quality of content we see here from university professors can be awful. – Thom A Oct 12 '22 at 16:52
  • @Larnu, yes, he could also miss it. But that the same thing came from another company. That made me concerned. Maybe they had students from the same University as well. However, I could not find the true author of this code to get his explanation and no one, who used it, could clarify it to me as well. – ulie Oct 12 '22 at 17:07
  • @ulie so now you know you can't get away with blindly trusting others. You have to understand the code. What this code does is pointless, even if it wasn't uses as a primary key. The only thing it does reliably is generate warnings ,because `MD5` is deprecated in SQL Server. `HASHBYTES` [always returns a varbinary(16)](https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver16#arguments) when MD5 is used. Converting this to `CHAR(32)` will pad it with 16 spaces, which means the end result will be the original padded with space bytes (0x20). – Panagiotis Kanavos Oct 13 '22 at 06:50
  • @ulie `how they generate hash keys.` why do you want a hash for a key in the first place? That's the actual answer you need. Why not use the columns as part of a PK or unique index instead? Or use a database-generated IDENTITY or Sequence ID values? – Panagiotis Kanavos Oct 13 '22 at 07:31

1 Answers1

0

Short Version

This code pads a hash with 0x20 bytes which is rather strange and most likely due to misunderstandings by the initial author. Using hashes as keys is a terrible idea anyway

Long Version

Hashes are completely inappropriate for generating primary keys. In fact, since the same hash can be generated from different original data, this code is guaranteed to produce duplicate values, causing collisions at best.

Worst case, you end up updating or deleting the wrong row, resulting in data loss. In fact, given that MD5 was broken over 20 years ago, one can calculate the values that would result in collisions. This has been used to hack systems in the past and even generate rogue CA certificates as far back as 2008.

And even worse, the concatenation expression :

(LTRIM(RTRIM(COALESCE(column1,'')))+';'+LTRIM(RTRIM(COALESCE(column2,''))))

Will create the same initial string for multiple different column values.

On top of that, given the random nature of hash values, this results in table fragmentation and an index that can't be used for range queries. Primary keys most of the time are clustered keys as well, which means they specify the order rows are stored on disk. Using essentially random values for a PK means new rows can be added at the middle or even the start of a table's data pages.

This also harms caching, as data is loaded from disk in pages. With a meaningful clustered key, it's highly likely that loading a specific row will also load rows that will be needed very soon. Loading eg 50 rows while paging may only need to load a single page. With an essentially random key, you could end up loading 50 pages.

Using a GUID generated with NEWID() would provide a key value without collisions. Using NEWSEQUENTIALID() would generate sequential GUID values eliminating fragmentation and once again allowing range searches.

An even better solution would be to just create a PK from the two columns :

ALTER TABLE ThatTable ADD PRIMARY KEY (Column1,Column2);

Or just add an IDENTITY-generated ID column. A bigint is large enough to handle all scenarios :

Create ThatTable (
    ID bigint NOT NULL IDENTITY(1,1) PRIMARY KEY,
    ...
)

If the intention was to ignore spaces in column values there are better options:

  • The easiest solution would be to clean up the values when inserting them.
  • A CHECK constraint can be added to each column to ensure the columns can't have leading or trailing spaces.
  • An INSTEAD OF trigger can be used to trim them.
  • Computed, persisted columns can be added that trim the originals, eg Column1_Cleaned as TRIM(Column1) PERSISTED. Persisted columns can be used in indexes and primary keys

As for what it does:

  1. It generates deprecation warnings (MD5 is deprecated)
  2. It pads the MD5 hash with 0x20 bytes. A rather ... unusual way of padding data. I suspect whoever first wrote this wanted to pad the hash to 32 bytes but used some copy-pasta code without understanding the implications.

You can check the results by hashing any value. The following queries

select hashbytes('md5','banana')
----------------------------------
0x72B302BF297A228A75730123EFEF7C41
select cast(hashbytes('md5','banana') as char(32))
--------------------------------
r³¿)z"Šus#ïï|A                

A space in ASCII is the byte 0x20. Casting to binary replaces spaces with 0x20, not 0x00

select cast(cast(hashbytes('md5','banana') as char(32)) as varbinary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4120202020202020202020202020202020

If one wanted to pad a 16-byte value to 32 bytes, it would make more sense to use 0x00. The result is no better than the original though

select cast(hashbytes('md5','banana') as binary(32))
------------------------------------------------------------------
0x72B302BF297A228A75730123EFEF7C4100000000000000000000000000000000

To get a real 32-byte hash, SHA2_256 can be used :

select hashbytes('sha2_256','banana')
------------------------------------------------------------------
0xB493D48364AFE44D11C0165CF470A4164D1E2609911EF998BE868D46ADE3DE4E
Panagiotis Kanavos
  • 120,703
  • 13
  • 188
  • 236