0

I am working on comparing Hash Values on SAP Hana Data and Azure SQL DB Data. However, it seems Data with special/encoded characters is returning different hash values on SAP Hana DB and SQL DW, even though the Values are same.

Query on Azure SQL DB:

SELECT HASHBYTES('MD5', '不锈钢平垫M10')
-- OUTPUT: 0xF2E5A4AF781A8D5DE0007A24E172CA47

SELECT HASHBYTES('MD5', 'ABCDEFGHIJK')
-- 0x30A4E38230885E27D1BB3FD0713DFA7D

Query on SAP Hana:

SELECT HASH_MD5(TO_BINARY('不锈钢平垫M10')) FROM DUMMY
-- OUTPUT: 0xB5C6519A1F1506431540CFF5980490B4

SELECT HASH_MD5(TO_BINARY('ABCDEFGHIJK')) FROM DUMMY
-- 0x30A4E38230885E27D1BB3FD0713DFA7D

Same is the case with SHA256 algo as well

Require support in handling the encoded/special characters with Hashing

Saurabh Mehta
  • 91
  • 1
  • 2
  • 9
  • If you are going to use varchar, you should consider the code page and encoding model on each. For SQL, it would inherit from the default database collation for this encoding. SQL supports MBCS with various code pages and UTF-8 collations on varchar. For nvarchar, it is UTF-16. In short, try casting each of the first values to binary and see what you get before the MD5 hash – Conor Cunningham MSFT Jun 10 '22 at 00:23
  • Does this answer your question? [Data Comparison between SAP Hana and SQL Server](https://stackoverflow.com/questions/72313347/data-comparison-between-sap-hana-and-sql-server) – Lars Br. Jun 10 '22 at 23:34
  • @LarsBr., That is a different topic, the hash values weren't matching coz they were different length data types: SAP Hana had varchar(3) and SQL DB had varchar(6) – Saurabh Mehta Jun 11 '22 at 03:27
  • @ConorCunninghamMSFT, following are the binary values on hana and sql server: SQL Server `select convert(binary,'不锈钢平垫M10') 0x3F3F3F3F3F4D313000000000000000000000000000000000000000000000` SAP HANA `SELECT TO_BINARY('不锈钢平垫M10') FROM DUMMY 0xE4B88DE99488E992A2E5B9B3E59EABEFBCADEFBC91EFBC90` – Saurabh Mehta Jun 11 '22 at 03:42
  • @ConorCunninghamMSFT, another confusion is how come its giving correct hash values for String 'ABCDEFGHIJK' on both HANA and SQL – Saurabh Mehta Jun 11 '22 at 12:07
  • 1
    So the binary values are different - that explains why the hash values are different. The reason behind this, I believe, is that there is an encoding difference for strings between what you are doing in SAP Hana and in SQL. SQL supports different collations (https://learn.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16 ) and some of these use different code pages (encodings for strings). You need to work backwards to what is your default database collation in SQL to understand the encoding format. May not match Hana. – Conor Cunningham MSFT Jun 11 '22 at 22:20

1 Answers1

2

Conor Cunningham MSFT is spot on! The different hash values are the result of different binary values. Each DBMS stores the string differently.

As documented for SAP HANA, the (N)VARCHAR data type stores unicode data (CESU-8/UTF-16).

All the characters from the input string 不锈钢平垫M10 are from the Basic Multilingual Plane and therefore encoded in the same way as in UTF-8.

Checking the string encoding in UTF-8 yields:

cat input.txt | xxd
00000000: e4b8 8de9 9488 e992 a2e5 b9b3 e59e abef  ................
00000010: bcad efbc 91ef bc90                      ........

Which is the exact same byte sequence that SAP HANA uses to encode the string (spaces inserted for readability):

SELECT  TO_BINARY('不锈钢平垫M10') FROM DUMMY  
0xE4B8 8DE9 9488 E992 A2E5 B9B3 E59E ABEF BCAD EFBC 91EF BC90

To make the hash functions work the same, the same binary input data is required. HANA does not support any other encoding for unicode character data, but SQL Server allows for storing character data in unicode. So, one option is to convert the character data column in SQL Server to the matching unicode encoding. This blog post covers this well.

Lars Br.
  • 9,949
  • 2
  • 15
  • 29