1

I need to remove the empty partitions from a Dataframe

We are having two Dataframes, both are created using sqlContext. And the dataframes are constructed and combined as below

import org.apache.spark.sql.{SQLContext}

val sqlContext = new SQLContext(sc)

// Loading Dataframe 1
val csv1 = "s3n://xxxxx:xxxxxx@xxxx/xxx.csv"
val csv1DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1) 

// Loading Dataframe 2
val csv2 = "s3n://xxxxx:xxxxxx@xxxx/xxx.csv"
val csv2DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1) 

// Combining dataframes 
val combinedDF = csv1.
                join(csv2 csv1("column_1") === csv2("column_2"))

Now the number of partition for combinedDF is 200. From here it is found that the default number of partition is 200 when we use joins.

In some cases the dataframe/csv is not big and getting many empty partition which causes issues later part of the code.

So how can I remove these empty partition created?

sag
  • 5,333
  • 8
  • 54
  • 91
  • 2
    You can `repartition` your dataframe. – abalcerek Jul 21 '15 at 10:47
  • What size I need to provide for repartition? – sag Jul 21 '15 at 11:13
  • @user52045 For repartition I have to provide the new partition size. But its hard to find the perfect size in runtime. For me, I think just removing the empty partition should be fine. – sag Jul 22 '15 at 03:59

1 Answers1

2

The repartition method can be used to create an RDD without any empty partitions.

This thread discusses the optimal number of partitions for a given cluster. Here is good rule of thumb for estimating the optimal number of partitions.

number_of_partitions = number_of_cores * 4

If you have a cluster of 8 r3.xlarge AWS nodes, you should use 128 partitions (8 nodes * 4 CPUs per node * 4 partitions per CPU).

Community
  • 1
  • 1
Powers
  • 18,150
  • 10
  • 103
  • 108