5

I am performing an incremental load on data coming from a Teradata database and storing it as a parquet file. Because the tables from Teradata contains billions of rows, I would like my PySpark script to compare hash values.

Teradata Table: An example table from Teradata

Current Stored Parquet File: Data stored in parquet file

My PySpark script uses a JDBC read connection to make the call to teradata:

tdDF = return spark.read \
  .format("jdbc") \
  .option("driver", "com.teradata.jdbc.TeraDriver") \
  .option("url", "jdbc:teradata://someip/DATABASE=somedb,MAYBENULL=ON") \
  .option("dbtable", "(SELECT * FROM somedb.table)tmp")

Spark script that reads in the parquet:

myDF = spark.read.parquet("myParquet")
myDF.createOrReplaceTempView("myDF")
spark.sql("select * from myDF").show()

How can I:

  • include a hash function in my call to teradata that returns the hash of the entire row values (this hash should be performed on Teradata)
  • Include a hash function in my PySpark code when reading in the parquet file that returns the hash of the entire row values (this hash should be performed in Spark)
  • Compare these two hashes to see which is the delta from Teradata that needs to be loaded

1 Answers1

1

You want to Insert new rows, or, if rows with identifying info exist, update them. This is called 'upsert' or in teradata, 'merge'.

It depends on which columns are allowed to change and which ones make a row 'new'.

In your examples there you have :

terradata

    Name   Account  Product
    ------+--------+---------
    Sam    1234     Speakers
    Jane   1256     Earphones
    Janet  3214     Laptop
    Billy  5678     HardDisk

parquet

    Name   Account  Product
    ------+--------+---------
    Sam    1234     Speakers
    Jane   1256     Earphones

So if any Name,Account combination should be unique, the database table should have a unique key defined for it.

With that, the database won't allow insert of another row with the same unique key, but will allow you to update it.

So going by this example, with your example data, youe sql commands would look like:

UPDATE somedb.table SET product = 'Speakers' WHERE name = 'Sam' AND account = 1234 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Sam',1234,'Speakers');
UPDATE somedb.table SET product = 'Earphones' WHERE name = 'Jane' AND account = 1256 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Jane',1256,'Earphones');
UPDATE somedb.table SET product = 'Laptop' WHERE name = 'Janet' AND account = 3214 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Janet',3214,'Laptop');
UPDATE somedb.table SET product = 'HardDisk' WHERE name = 'Billy' AND account = 5678 ELSE INSERT INTO somedb.table(name, account, product) VALUES('Billy',5678,'HardDisk');

But this is a very simplistic approach that will likely perform very poorly.

Googleing 'teradata bulk upload' finds links such as

There are likely many others.

jmullee
  • 390
  • 3
  • 6