I am writing an ELT process in Databricks which will be downloading around 50 million records everyday and the table will have a year worth of data only. I optimized the query the best I could. The data is in Teradata database. I ran the EXPLAIN to see how long it will take. Mostly it shows only up to 10 minutes but when I run the query in Databricks to download the data in spark dataframe, and eventually to delta table then it takes almost all day.
I can't use the date column as a partition because each ELT process will be downloading 1 day's worth of data. I am thinking to use any other column.
sales_query = f""" SQL Query """
for x in range(0, 10):
connection_error = False
print(' Attempt #' + str(x+1))
try:
sales = (spark.read.format("jdbc")
.option("url", database_connection)
.option("driver", "com.teradata.jdbc.TeraDriver")
.option("fetchsize","1000000")
.option("query", sales_query)
# .option("partitionColumn","sales_date")
# .option("lowerBound", f'{start_date}')
# .option("upperBound", f'{end_date}')
# .option("numPartitions", 8)
.load())
break
except Exception as connection_error:
print('Query Failed. Reason ' + str(connection_error) + '\n' + 'Retrying...')
sleep(sleep_time)
This above only takes 20 seconds which does not make sense to me, but I was told that here it only downloads the metadata instead of all the data (50 million records).
sales.write.format("delta").mode("overwrite").option("overwriteSchema","true").save(f"{project_path}/{level}/{delta_file_name}")
Is there anyway I can fasten this data extraction? I am not very much familiar with Databricks cluster but here is some info about the cluster I am using if it helps.
12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)
1-2 Workers
64-128 GB Memory
8-16 Cores
1 Driver
64 GB Memory, 8 Cores
Runtime
12.2.x-scala2.12