I am performing an incremental load on data coming from a Teradata database and storing it as a parquet file. Because the tables from Teradata contains billions of rows, I would like my PySpark script to compare hash values.
Teradata Table: An example table from Teradata
Current Stored Parquet File: Data stored in parquet file
My PySpark script uses a JDBC read connection to make the call to teradata:
tdDF = return spark.read \
.format("jdbc") \
.option("driver", "com.teradata.jdbc.TeraDriver") \
.option("url", "jdbc:teradata://someip/DATABASE=somedb,MAYBENULL=ON") \
.option("dbtable", "(SELECT * FROM somedb.table)tmp")
Spark script that reads in the parquet:
myDF = spark.read.parquet("myParquet")
myDF.createOrReplaceTempView("myDF")
spark.sql("select * from myDF").show()
How can I:
- include a hash function in my call to teradata that returns the hash of the entire row values (this hash should be performed on Teradata)
- Include a hash function in my PySpark code when reading in the parquet file that returns the hash of the entire row values (this hash should be performed in Spark)
- Compare these two hashes to see which is the delta from Teradata that needs to be loaded