1

Using Hive, I want to hash an entire row in the query.

I tried the following (don't mind the ${xxx}, query is built from a bash script):

    SELECT md5(*) FROM ${DATABASE_NAME_SUFFIXE}.${DATABASE_PREFIXE}_${TABLE_NAME} WHERE 
${TABLE_DATE_FIELD} <= '${LIMIT_DATE}' ORDER BY ${CREATION_DATE_FIELD} DESC LIMIT 1

This returns the following error:

Line 1:7 Wrong arguments 'md5': No matching method for class org.apache.hadoop.hive.ql.udf.UDFMd5 with (bigint, int, varchar(128), timestamp, timestamp, varchar(64), varchar(64), varchar(64), int, bigint, int, varchar(50), varchar(255), bigint, timestamp, timestamp, varchar(64), bigint, timestamp, timestamp, varchar(64), int, int, char(38), varchar(40), varchar(1)). Possible choices: FUNC(binary) FUNC(string)

If I understand correctly from the error and from the documentation of md5 function, I need to pass binary or string. How can I achieve that ?

Edit: also tried:

SELECT md5(SELECT * FROM ${DATABASE_NAME_SUFFIXE}.${DATABASE_PREFIXE}_${TABLE_NAME} WHERE ${TABLE_DATE_FIELD} <= '${LIMIT_DATE}' ORDER BY ${CREATION_DATE_FIELD} DESC LIMIT 1) 

which returns

cannot recognize input near 'SELECT' '*' 'FROM' in function specification

Itération 122442
  • 2,644
  • 2
  • 27
  • 73

1 Answers1

1

concat all the columns and then have md5() on the concatenated column.

select md5(concat(a,b)) as md5 from (select string("abc")a,int("2")b)e;
+--------------------------------+
|md5                             |
+--------------------------------+
|63872b5565b2179bd72ea9c339192543|
+--------------------------------+

We can also define variable having all column names then use in concat function!


Try with concat(*):

select md5(concat(*)) as md5 from (select string("abc")a,int("2")b)e;
+--------------------------------+
|md5                             |
+--------------------------------+
|63872b5565b2179bd72ea9c339192543|
+--------------------------------+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • There is absolutely no way to do it without specifying all column names ? – Itération 122442 Mar 24 '20 at 16:02
  • @FlorianCastelain try with `concat(*)`! – notNull Mar 24 '20 at 16:07
  • Working on it right now. However in the subquery, it seems mandatory to indicate column names. It ads complexity to my query building logic. For example, this won't work: `SELECT md5(concat(*)) as hash from (SELECT * FROM BATCH WHERE last_modified_time_u <= '' ORDER BY creation_time_u DESC LIMIT 1)` – Itération 122442 Mar 25 '20 at 07:13
  • @FlorianCastelain, I was able to run similar query: **`select md5(concat(*)) from (select * from i where 1=1)e;`** you need to alias subquery.. **`SELECT md5(concat(*)) as hash from (SELECT * FROM BATCH WHERE last_modified_time_u <= '' ORDER BY creation_time_u DESC LIMIT 1)e`** – notNull Mar 25 '20 at 14:34