7
SELECT checksum(c)
FROM 
(
   SELECT to_hex( md5( to_utf8( 
    concat(
          trim(coalesce("dcpheader_dcploaddayutc",'')),
          trim(coalesce("dcpheader_dcploadmonthutc",''))
          )
          ) ) ) -- end of to_hex, md5, to_utf ()
    FROM datalake_landing.instantoffer_v2_member_new
) as z(c)

The above SQL query works in AWS Athena/Presto and produces a single checksum for a set of rows so that I can determine if any data has changed or to compare a set of rows in one table with a set of rows in another table for a row-set equality/inequality.

The sample output is 1 row/col i.e. f5 09 49 03 a6 26 fd 5e

Sadly the above SQL does not work in Spark/SQL 2.4 (latest version as of March 2019).

Running same statement in Spark SQL produces errors as the built-in functions do not exist, at least not under same name as in Presto.

https://spark.apache.org/docs/latest/api/sql/index.html#hex

I looked at list of Spark/SQL functions and not found a Checksum equivalent to Presto.

What I would like to know is if it is possible to change the SQL query to produce the same result in Spark/SQL 2.4? Ideally in a pure Spark/SQL or as a second option using PySpark/SQL?

This is my Spark/SQL version so far:

spark.sql("""SELECT hex( md5( \
    encode( \
      concat( \
        trim(coalesce(dcpheader_dcploaddayutc,'')), \
        trim(coalesce(dcpheader_dcploadmonthutc,''))
        ) \
        , 'utf-8') ) )  \
    FROM datalake_landing.instantoffer_v2_member_new """).show(10,False)

This row-set needs to be checksum/aggregated into one value, which Presto handles with a Checksum() function.

Below is a "poor man" substitute which uses row-by-row-checksum to inner join to compare two row-sets for equality - target and destination tables. I was hoping for easier/cleaner/faster alternative in Spark SQL which Presto checksum aggregate provides.

spark.sql(""" \
        with t1 as ( \
        SELECT hex( md5( encode( \
            concat( \
                  trim(coalesce(dcpheader_generatedmessageid,'')), \
                  trim(coalesce(dcpheader_dcploadmonthutc,'')) \
                  ), 'utf-8') \
                  )) as c \
        FROM datalake_reporting.instantoffer_v2_member_6 order by c), \
        t2 as ( \
        SELECT hex( md5( encode( \
            concat( \
                  trim(coalesce(dcpheader_generatedmessageid,'')), \
                  trim(coalesce(dcpheader_dcploadmonthutc,'')) \
                  ), 'utf-8') \
                  ) )  as c \
         FROM datalake_landing.instantoffer_v2_member_new \
           order by c) \
        select count(*) from t1 inner join t2 using (c) \
        """).show(100, False)

Thank you

Acid Rider
  • 1,557
  • 3
  • 17
  • 25

1 Answers1

0

I was able to run an SQL md5 query in Athena (Presto) which provided the same output as Spark, see below:

SELECT to_hex(md5(to_utf8(‘Spark’)))
8CDE774D6F7333752ED72CACDDB05126

vs.

> SELECT md5('Spark');
 8cde774d6f7333752ed72cacddb05126

see for more details:

https://spark.apache.org/docs/latest/api/sql/index.html#hex https://prestodb.io/docs/current/functions/binary.html

Hanan Shteingart
  • 8,480
  • 10
  • 53
  • 66