2

Using the SQL Hive function md5() is generating a MD5 hash, with 32 characters. Is there a way in Hive to have the character limit reduced without compromising the integrity of the output? Unfortunately, I have a limit for certain columns that cannot be changed.

I am trying to mask some data fields and thought using a MD5 would be the best option, but am open to other methods. I know the rand() function can be used, but hoping to avoid it.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Logan
  • 293
  • 3
  • 11

1 Answers1

2

Use CRC32 or hash256 and truncate value to the desired length, this will be good enough if you do not need crypto-strong algorithm. sha256 truncated will have less probability of collisions, I'd recommend it.

Example:

select crc32('Some test message') AS CRC32, 
       sha2('Some test message', 256) as sha256, 
       substr(sha2('Some test message', 256),1,10) as sha256truncated

Result:

crc32       sha256                                                              sha256truncated
111182007   43bf899ff002b5fa8e0510b22943ee5c15e467e26fe777d0740ba004ebf3a28d    43bf899ff0
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Just to clarify, you recommend sha256 to avoid many collisions? I have a large dataset and want to avoid truncating if I can, but I guess that's unavoidable. – Logan Feb 23 '21 at 20:27
  • 1
    @Logan then keep it as long as possible. for different values every byte of sha function will be different, sha function truncated is still is sha function, of course the probability of collisions will increase. – leftjoin Feb 23 '21 at 20:32
  • 1
    @Logan If the dataset is big, collisions are unavoidable. Read also this: https://stackoverflow.com/q/19962424/2700344 – leftjoin Feb 23 '21 at 20:35
  • Thank you for the helpful article. My goal essentially is to take my random IDs and change it to a max of about 20 character value. I think this will definitely help. Let me know if I missed anything! – Logan Feb 23 '21 at 20:43
  • 1
    @Logan hash is much better than rand. and rand is not actually random. One more good article about collision probability of truncated hash https://news.ycombinator.com/item?id=14952547 – leftjoin Feb 23 '21 at 20:50
  • @Logan Also what is most important, random does not mean unique. If you have already some unique key and want to reduce it's length, sha256 is still better option then random values. though I do not get completely what are you doing. some data example would be great if you need more help – leftjoin Feb 23 '21 at 21:16
  • Thank you for your comment. I guess my goal is to create unique values for my ID column. I am trying to basically mask the original values so I can share data with others without revealing sensitive details. The ID column will have numerous repeats, and as such the unique output values should account for that. Did I make more sense? – Logan Feb 23 '21 at 21:19
  • 1
    @Logan sha256 will be exactly the same for same input values. it is deterministic and one-way function. If you need non-deterministic hashes, salt your value with random before hashing. But I guess you need just deterministic one-way function for obfuscation, preserving uniqueness and possibility to join or group by with same function without possibility to get original values, sha256 is best fit – leftjoin Feb 23 '21 at 21:23
  • 1
    @Logan You can also apply some logic for common known values or nulls and return NULLs for example for empty strings or spaces only if those should be treated as NULLs, etc. – leftjoin Feb 23 '21 at 21:32
  • Hopefully I am not bothering you too much, but after running the code sha2(column_name, 256), I am receiving the error that the argument type is mismatch and sha2 only takes string_group or binary_group as the first argument. Can I not simply state the column name from another table I need hashed? – Logan Feb 24 '21 at 19:14
  • 1
    @Logan Convert to string `sha2(string(col_name), 256)` – leftjoin Feb 24 '21 at 19:39