Pyspark insert overwrite with dynamic partition is very slow

Asked Sep 02 '22 at 17:43

Active Sep 04 '22 at 10:53

Viewed 100 times

I am reading a 60gb sized csv file using pyspark, doing few basic transformations and loading it into hive dynamic partition table. Hdfs block size is 128mb, so 400+ partitions are created in spark. Transformation is completing in few minutes. But while loading it's taking nearly an hour. Hive execution load is on tez. Tried to load the unpartitioned table, taking less than 4 minutes. How can i improve the performance in this scenario?

I'm using hive warehouse connector.

edited Sep 04 '22 at 10:53

asked Sep 02 '22 at 17:43

Raja

Pyspark insert overwrite with dynamic partition is very slow

0 Answers0