I need to remove the empty partitions from a Dataframe
We are having two Dataframes, both are created using sqlContext. And the dataframes are constructed and combined as below
import org.apache.spark.sql.{SQLContext}
val sqlContext = new SQLContext(sc)
// Loading Dataframe 1
val csv1 = "s3n://xxxxx:xxxxxx@xxxx/xxx.csv"
val csv1DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Loading Dataframe 2
val csv2 = "s3n://xxxxx:xxxxxx@xxxx/xxx.csv"
val csv2DF = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load(csv1)
// Combining dataframes
val combinedDF = csv1.
join(csv2 csv1("column_1") === csv2("column_2"))
Now the number of partition for combinedDF is 200. From here it is found that the default number of partition is 200 when we use joins.
In some cases the dataframe/csv is not big and getting many empty partition which causes issues later part of the code.
So how can I remove these empty partition created?