Which is better Storing hash value or the bigint variable from which the hash value is generated

Question

I have a table in which a column stores image src which is in hash value and that hash value is generated from microtime(),Now I have two choice storing directly hash value in database or storing that bigint microtime from which the image name is derived.Which would make my db faster.

I doubt it will make any difference to speed, unless the hashcode datatype is much shorter than the time datatype, but an important scientific principle may apply here: don't lose the original data. You never know what you may need it for. — user207421, May 24 '12 at 07:47
I cannot understand how you can possibly infer that from what I said. I am in favour of exactly the opposite. The 'original data' in this case is the clearly the microtime from which the hash is generated. — user207421, May 25 '12 at 02:18

Mihai Stancu · Accepted Answer · 2012-06-25T10:37:27.510

We have to analyze this from all sides to asses what speed faults are incured.

I will make a few assumptions:

this data will be used as an identifier (primary key, unique key, composite key);
this data is used for searches and joins;
you are using a hashing algorithm such as SHA1 that yields a 40 character string of hex encoded data (MD5 yields a 32 character string of hex encoded data all said bellow can be adapted to MD5 if that's what you're using);
you may be interested in converting the hex values of the hash into binary to reduce the storage required by half and to improve comparison speed;

Inserting and Updating on the application side:

As @Namphibian stated is composed of 2 operations for the BIGINT versus 3 operations for the CHAR.

But the speed difference in my opinion really isn't that big. You can run 10.000.000 continuous calculations (in a while loop) and benchmark them to find out the real difference between them.

Also a speed difference in the application code affects users linearly, while speed differences in the DB affect users nonlinearly when traffic increases because overlapping writes have to wait for each other and some reads have to wait for writes to finish.

Inserting and Updating on the DB side:

Is almost the same for a BIGINT as it is for a CHAR(40) or a BINARY(20) because the more serious time consumption is done waiting for access to the disk rather than actually writing to it.

Selecting and Joining on the DB side:

This is always faster for a BIGINT compared to a CHAR(40) or a BINARY(20) for two reasons:

BIGINT is stored in 8 bytes while CHAR(40) is stored in 40 bytes and BINARY(20) in 20 bytes;
BIGINT's serially increasing nature makes it predictable and easy to compare and sort.

Second best option is the BINARY(20) because it saves some space and it is easier to compare due to reduced length.

Both BINARY(20) and CHAR(40) are the result of the hashing mechanism and are randomized, hence comparing and sorting takes a longer time on average because randomized data in indexes (for a btree index) needs more tree traversals to fetch (i mean that in the context of multiple values, not for one single value).

score 0 · Answer 2 · answered Jun 25 '12 at 10:39

0

An important scientific principle may apply here: don't lose the original data. You never know what you may need it for.

answered Jun 25 '12 at 10:39

user207421

305,947
44
307
483

A timestamp regarding when some file was uploaded is not *original data*, it is metadata. *Original data* would be the original filename of the uploaded file. – Mihai Stancu Jun 25 '12 at 10:42
Also i agree that the original data should be preserved. So Storing the original image name alongside an identifier (one more optimized for indexing and retrieval) would be my option. – Mihai Stancu Jun 26 '12 at 09:20

Which is better Storing hash value or the bigint variable from which the hash value is generated

2 Answers2

Inserting and Updating on the application side:

Inserting and Updating on the DB side:

Selecting and Joining on the DB side: