Download ~50 million records from Teradata to Spark Delta table in Databricks daily efficiently

Question

I am writing an ELT process in Databricks which will be downloading around 50 million records everyday and the table will have a year worth of data only. I optimized the query the best I could. The data is in Teradata database. I ran the EXPLAIN to see how long it will take. Mostly it shows only up to 10 minutes but when I run the query in Databricks to download the data in spark dataframe, and eventually to delta table then it takes almost all day.

I can't use the date column as a partition because each ELT process will be downloading 1 day's worth of data. I am thinking to use any other column.

sales_query = f""" SQL Query """

for x in range(0, 10):
  connection_error = False
  print(' Attempt #' + str(x+1))
  try:
    sales = (spark.read.format("jdbc")
        .option("url", database_connection)
        .option("driver", "com.teradata.jdbc.TeraDriver")
        .option("fetchsize","1000000") 
        .option("query", sales_query)
        # .option("partitionColumn","sales_date")  
        # .option("lowerBound", f'{start_date}')
        # .option("upperBound", f'{end_date}')
        # .option("numPartitions", 8)
        .load())
    break
  except Exception as connection_error:
    print('Query Failed. Reason ' + str(connection_error) + '\n' + 'Retrying...')
    sleep(sleep_time)

This above only takes 20 seconds which does not make sense to me, but I was told that here it only downloads the metadata instead of all the data (50 million records).

sales.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(f"{project_path}/{level}/{delta_file_name}")

Is there anyway I can fasten this data extraction? I am not very much familiar with Databricks cluster but here is some info about the cluster I am using if it helps.

12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

1-2 Workers
64-128 GB Memory 8-16 Cores 1 Driver 64 GB Memory, 8 Cores Runtime 12.2.x-scala2.12

Not sure why the question was downvoted. – hkay Aug 06 '23 at 15:49 — hkay, Aug 06 '23 at 15:49

score 1 · Accepted Answer · answered Aug 06 '23 at 11:19

1

Have a look at this article. You can synthesise the column used by hashing over most fields (provided hash algo and approach is the same).

Or to use upper and lower bounds by using the limit - using this answer here as a base your predicates could add:

QUALIFY RowNum_ BETWEEN $start and $startplusstep;

(also you probably don't want .mode("overwrite") that won't keep other days results let alone a year).

answered Aug 06 '23 at 11:19

Chris

1,240
7
8

Thanks Chris. This answer looks very promising. Let me try and I will update you here. – hkay Aug 06 '23 at 16:05
I was able to fasten the process using the partition method as mentioned in the article and the other stackoverflow solution. Thanks again Chris. It only took 1.5 hours this time – hkay Aug 07 '23 at 22:12
Glad to see such a drop in time, nice! – Chris Aug 08 '23 at 09:50

Download ~50 million records from Teradata to Spark Delta table in Databricks daily efficiently

1 Answers1