SELECT checksum(c)
FROM
(
SELECT to_hex( md5( to_utf8(
concat(
trim(coalesce("dcpheader_dcploaddayutc",'')),
trim(coalesce("dcpheader_dcploadmonthutc",''))
)
) ) ) -- end of to_hex, md5, to_utf ()
FROM datalake_landing.instantoffer_v2_member_new
) as z(c)
The above SQL query works in AWS Athena/Presto and produces a single checksum for a set of rows so that I can determine if any data has changed or to compare a set of rows in one table with a set of rows in another table for a row-set equality/inequality.
The sample output is 1 row/col i.e. f5 09 49 03 a6 26 fd 5e
Sadly the above SQL does not work in Spark/SQL 2.4 (latest version as of March 2019).
Running same statement in Spark SQL produces errors as the built-in functions do not exist, at least not under same name as in Presto.
https://spark.apache.org/docs/latest/api/sql/index.html#hex
I looked at list of Spark/SQL functions and not found a Checksum equivalent to Presto.
What I would like to know is if it is possible to change the SQL query to produce the same result in Spark/SQL 2.4? Ideally in a pure Spark/SQL or as a second option using PySpark/SQL?
This is my Spark/SQL version so far:
spark.sql("""SELECT hex( md5( \
encode( \
concat( \
trim(coalesce(dcpheader_dcploaddayutc,'')), \
trim(coalesce(dcpheader_dcploadmonthutc,''))
) \
, 'utf-8') ) ) \
FROM datalake_landing.instantoffer_v2_member_new """).show(10,False)
This row-set needs to be checksum/aggregated into one value, which Presto handles with a Checksum() function.
Below is a "poor man" substitute which uses row-by-row-checksum to inner join to compare two row-sets for equality - target and destination tables. I was hoping for easier/cleaner/faster alternative in Spark SQL which Presto checksum aggregate provides.
spark.sql(""" \
with t1 as ( \
SELECT hex( md5( encode( \
concat( \
trim(coalesce(dcpheader_generatedmessageid,'')), \
trim(coalesce(dcpheader_dcploadmonthutc,'')) \
), 'utf-8') \
)) as c \
FROM datalake_reporting.instantoffer_v2_member_6 order by c), \
t2 as ( \
SELECT hex( md5( encode( \
concat( \
trim(coalesce(dcpheader_generatedmessageid,'')), \
trim(coalesce(dcpheader_dcploadmonthutc,'')) \
), 'utf-8') \
) ) as c \
FROM datalake_landing.instantoffer_v2_member_new \
order by c) \
select count(*) from t1 inner join t2 using (c) \
""").show(100, False)
Thank you