0

So i have a data frame which output 300 GB records into S3 . All the data is partitioned in 2K partition.

.partitionBy("DataPartition", "PartitionYear", "PartitionStatement")

But the issue is some of the partition has very huge data (40GB) and some has only 10MB .

So if i re partition it again like .repartition(100) then it creates many files for even 10MB data size partition also and that leads to huge no of output files . When i try to load all the output files and run my spark job then because of many small files my spark job becomes very slow .

What i am looking for is can we re partition only those data frames which has huge no of records .

Lest say i know for below partition(Japan 1970 BAL) the records will be huge so can we repartition that partition only ?

I did not found anything about this on internet .

I tried even this

dfMainOutputFinalWithoutNull.repartition($"DataPartition", $"PartitionYear", $"PartitionStatement")
      .write
      .partitionBy("DataPartition", "PartitionYear")

But this also did not turns out in good performance .

Please suggest something so that my output files no also not become every huge and my job also runs faster .

This is not working for me

 val rddFirst = sc.textFile(mainFileURL)
    val rdd = rddFirst.repartition(190)

Because when i try to split the file name i get error .

Sudarshan kumar
  • 1,503
  • 4
  • 36
  • 83
  • you should repartition equally in all executors. the way you are repartitioning is good before you perform series of aggregations . So before write process you should repartition as many as you have executors. lets say you have 16 executors running parallelly then you should do .repartition(16) and you should be good – Ramesh Maharjan Mar 21 '18 at 11:31
  • @RameshMaharjan But if .repartition(16) then it will create 2K*16 total files would be 32K and then if load all this next time in spark Will it not create problem ? – Sudarshan kumar Mar 21 '18 at 11:36
  • reduce the repartition value then – Ramesh Maharjan Mar 21 '18 at 12:58
  • @RameshMaharjan `rddFirst.repartition(190)` will it work ? ..I am getting one field from file name also – Sudarshan kumar Mar 22 '18 at 07:48
  • I don't know if that will get you the fastest time. You got to test it with different configurations . may be increase the executors or increasing and decreasing the memory and many more. I repeat again **you should repartition equally in all executors** – Ramesh Maharjan Mar 22 '18 at 08:12

0 Answers0