I need to understand the algorithm used by Hive to hash partition data. For example, Spark uses Murmur Hashing. Any ideas or resources?
Asked
Active
Viewed 789 times
1 Answers
0
Partitions in Hive are folders, one folder for each partition key value, not hashed (can be composite key). Hive does not support other partitioning types such as hash or range.
But you can calculate hash in the SQL and use dynamic partitioning when writing the data.
like this, using reflect you can call static Java method:
insert into table partition(mycolumn)
SELECT ...
reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', mycolumn)
FROM mytable;
Also Hive has int hash(a1[, a2...])
, sha2(string/binary, int)
and mask_hash(string|char|varchar)
native functions.
Hive is using hashing for bucketing. Buckets are files. See this question about hashing in buckets.

leftjoin
- 36,950
- 8
- 57
- 116