0

To put it simply i have a very large database with hundreds of thousands of entries and hundreds of different columns.

Some of those columns need to be hashed in order to save space, etc.. However when i try to hash them like this:

select distinct
columnA + hashbytes('sha1', [Column_in_question]) 
from [dbo].[Tabled_in_question]

I end up with more rows than if i just did this:

select distinct
columnA + [Column_in_question]
from [dbo].[Tabled_in_question]

My best guess is that the select distinct is not case sensitive, whereas Hashbytes is. But i don't really know how i can test this or fix it.

Any ideas?

1 Answers1

0

you are right the difference is the case sensitivity

you can check it using

select distinct
convert(VARBINARY(10), [Column_in_question]),
columnA + hashbytes('sha1', [Column_in_question]) 
from [dbo].[Tabled_in_question]

the collation of db is most probably CI (case insensitive) but hashbytes use.. bytes, and as you can see converting text to varbinary, they are different

try this to change the collation and comparision rules

select distinct
columnA + [Column_in_question] collate LATIN1_GENERAL_BIN
from [dbo].[Tabled_in_question]
MtwStark
  • 3,866
  • 1
  • 18
  • 32
  • I had a look at your suggestions and while that does help a bit, there is still some discrepancy between the number of rows. 3971 when i used the collate versus 3974 when using the hashbytes. Still, it's much closer than the 3952 i got when not using collate, so i'm assuming that there is also some other issue with that column – Tiago Dias Jan 11 '17 at 10:58
  • what you mean ? – MtwStark Jan 11 '17 at 10:59
  • I meant that i still get more rows using the HASHBYTES than with the collation you suggested. So i'm assuming that there is some other issue, beyond the case sensitivity that is causing more rows to appear. – Tiago Dias Jan 11 '17 at 11:11
  • test `convert(VARBINARY(max), [Column_in_question])` this is the binary content of column.. if it is the same, hash should be the same – MtwStark Jan 11 '17 at 11:19
  • It is the same! Thanks! – Tiago Dias Jan 11 '17 at 11:23