So i have a data frame which output 300 GB records into S3 . All the data is partitioned in 2K partition.
.partitionBy("DataPartition", "PartitionYear", "PartitionStatement")
But the issue is some of the partition has very huge data (40GB) and some has only 10MB .
So if i re partition it again like .repartition(100)
then it creates many files for even 10MB data size partition also and that leads to huge no of output files .
When i try to load all the output files and run my spark job then because of many small files my spark job becomes very slow .
What i am looking for is can we re partition only those data frames which has huge no of records .
Lest say i know for below partition(Japan 1970 BAL)
the records will be huge so can we repartition that partition only ?
I did not found anything about this on internet .
I tried even this
dfMainOutputFinalWithoutNull.repartition($"DataPartition", $"PartitionYear", $"PartitionStatement")
.write
.partitionBy("DataPartition", "PartitionYear")
But this also did not turns out in good performance .
Please suggest something so that my output files no also not become every huge and my job also runs faster .
This is not working for me
val rddFirst = sc.textFile(mainFileURL)
val rdd = rddFirst.repartition(190)
Because when i try to split the file name i get error .