We are using Hash-Key function for one of the source tables to create a unique identifier key. But Hash-Key function has some limitations with respective to 32 bit integer. We tried using MD5 but we don't want to use Char based key for Char based data.
-
If you don´t have unique input, you can´t have unique hash outputs, doesn´t matter what you´re trying. – deviantfan May 07 '15 at 08:39
1 Answers
You might find this question I asked interesting for further reading. One of the answers links to this MySQL documentation page which suggests using a VARBINARY
field for strings of arbitrary byte values. You haven't tagged your question, so I will phrase the rest of this answer in terms of MySQL; hopefully your RDBMS of choice isn't too hard to translate to.
Many encryption and compression functions return strings for which the result might contain arbitrary byte values. If you want to store these results, use a column with a
VARBINARY
orBLOB
binary string data type. This will avoid potential problems with trailing space removal or character set conversion that would change data values, such as may occur if you use a nonbinary string data type (CHAR
,VARCHAR
,TEXT
).
A hash function output is basically a very long number. You frequently see them as strings because many code libraries will display them as some encoded format (hexadecimal or Base32). As your question says, putting these into a nonbinary string fields is a bad idea and a waste of space and lookup time. So get your application to convert the output of the hash to binary data (most frequently a byte[]
) and store them in a VARBINARY
column.
Another option is to leave it as a string and encode it Base32 (5 bits per byte) which wastes significantly less space than hexadecimal (4 bits per byte) - 25% less to be exact. The chief advantage of this is that the strings remain human readable and transmittable over common protocols without further encoding. This makes it easier to match your database to the web visible data, which can save a great deal of development and debugging time. Then set the column to use a _bin
collation type, which speeds up the comparison at the cost of losing case sensitivity.
Note that you can't use this trick with Base64 encoding (6 bits per byte) because the base64 output is itself case sensitive.