Performance optimization of a Pyspark code which uses hive external table query multiple times

Question

I am having a Pyspark code which applies complex transformations. In this code, we are using one particular hive external table multiple times,to be precise the subset data using partitioned column from the table multiple times..

Now if I save the this data into a managed table or databricks delta table and access this table in the code, will the performance increase?

Also as I will be accessing all the data in the new table, do I need to partition it

I have implemented the non partioned managed table and delta table and had seen 20% Increase...but not sure how the impact incase of partitioned table

.cache() keeps the table partitions. examples of usage here: https://spark.apache.org/docs/3.2.0/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.spark.cache.html — smurphy, Aug 25 '23 at 01:44
In general, caching may help when querying the same table or dataframe repeatedly, and Delta Lake is generally more performant over formats like plain Parquet due to features like Delta's "data skipping". But everything else really depends: a lot on your code, the partitioning strategy, file size & skew, compute hardware (e.g. Delta acceleration is available on certain instance types, and Databricks' Photon engine, possibly Graviton instances on AWS, etc). Can you provide more info, and at least some code? — Zach King, Aug 26 '23 at 03:58

Performance optimization of a Pyspark code which uses hive external table query multiple times

0 Answers0