Could anyone help provide a tutorial on hashing?

Question

Recently I've read some papers on hashing techniques. It seems that hashing is everywhere.

In computer science, the hash table is commonly used as a efficient look-up data structures.

In encryption, the hashing is in the techniques such as md5 hash, sha hash, etc.

In the database area. The hashing is to build the key of the table in databases.

In machine learning, the hashing is to create short hash codes for efficient processing and economical storage, such as locality sensitive hashing, min-hash, sim-hash, hashing trick, and so on.

What are the same and different points of these applications on hashing? Could you help provide some readings or references on these hashing? Especially the differences on them. I'm confused on these hashing techniques.

cybermike · Accepted Answer · 2015-03-29T12:52:46.830

I think the essential point of hashing is the ability to take a group of content that is variable length, dynamic in nature, and asynchronous, and be able to apply an algorithm to each member of that content that results in a "stable", fixed-sized, and essentially unique identifier for each. That is the point of most of the examples you cited:

Hash Tables: transform a variable length key string or structure into a "stable" unique identifier with known lower and upper bounds (aka row numbers in an array, addresses of rows in an array, row numbers in a database).
Cryptography: transform a variable length plain-text into a stable, unique, and fixed-length identifier.
Machine learning (at least the Hashing Trick): transform words (and perhaps their context) into a stable and unique key into a universal numerically organized ontology

In all these cases you are making a small summary of the variable length content within each member of the group. Those small summaries make it much easier to deal with all the variable length content, and in the cases of hash tables can significantly speed up processing. Or especially in the case of cryptography can provide significant benefits, such as password protection (when using proper keyed and repetitive hashing) or content integrity verification.

You will note that hashes almost always result in the potential for collisions: e.g. two completely different members of the group with different content yet the hash algorithm generates the same summary/hash value. A critical part of the design of the hash function is to determine the acceptable level of duplications allowed, and in the design of the hash implementation to properly deal with a collision when it happens. For a hash table using only a small amount of RAM the collision rate may be high. Using 256 bit crypto-hashing functions, the probability of collision may be effectively zero.

Also, hashing is almost always "one way". Most hashing algorithms are deliberately "lossy" (which is why duplicates happen), and because of that one usually cannot reverse calculate the original variable length content from just the summary/hash value. There are brute force ways around that, but simple and fast reverse calculation is usually not possible.

Note that we use "hashing algorithms" in our real lives as well. We use first names of coworkers in large companies as a convenience in talking/ emailing/ chatting (a trivial hash) even though there will certainly be many coworkers with the same first name. And thus collisions happen ("Do you mean Mary in Accounting or Mary in Shipping?"). You may "hash" all the known products of facial tissue into the word "Kleenex" (at least in the U.S.), yet still prefer to buy and use a different brand.

Hi, @cybermike, thank you very much for replying and so detail explanations! I'm a newbie in hashing, thus has a very superficial understanding on it. I think the hashing functions is designed for mapping data items into the buckets and trying to avoid collisions. But the feature hashing in machine learning is to mapping similar data items to similar or the same hash codes. Both are for economical storage and faster processing. I think the difference mainly lies in the different applications need different design schemes of hashing functions. — mining, Mar 29 '15 at 12:42
I have this question mainly because I want to have a full view on hashing, which could help give a deep understanding on its characteristics and how to use them according to their different advantages and disadvantages. — mining, Mar 29 '15 at 12:46
In your mind, separate the high-level concept of hashing from specific techniques and implementations used for specific problem sets. The concept of hashing allows humans and computers to deal with non-numeric data that each member has unique values, and transform them into a numeric summary. The implementations of hashing of quite diverse, applied to a lot of different problems and design patterns, and indeed re-use the word "hash", but are only conceptually similar. — cybermike, Mar 29 '15 at 12:48
Thanks! @cybermike. As you said, the hashing in different applications are conceptually similar, I think it needs time to get all its usages and differences. When I met some projects on them, or had more experience of them, I would understand them more easily. — mining, Mar 29 '15 at 12:59

Could anyone help provide a tutorial on hashing?

1 Answers1