Spark 1.6: Store dataframe into multiple csv file in hdfs (partition by id)

Question

I'm trying to save a dataFrame into csv partition by id, for that I'm using spark 1.6 and scala. The function partitionBy("id") dont give me the right result.

My code is here :

validDf.write
       .partitionBy("id")
       .format("com.databricks.spark.csv")
       .option("header", "true")
       .option("delimiter", ";")
       .mode("overwrite")       
       .save("path_hdfs_csv")

My Dataframe looks like  :
-----------------------------------------
| ID        |  NAME       |  STATUS     |
-----------------------------------------
|     1     |     N1      |     S1      |
|     2     |     N2      |     S2      |
|     3     |     N3      |     S1      |
|     4     |     N4      |     S3      |
|     5     |     N5      |     S2      |
-----------------------------------------

This code create 3 csv default partitions (part_0, part_1, part_2) not based on column ID.

What I expect is : getting sub dir or partition for each id. Any help ?

score 0 · Answer 1 · answered Mar 26 '20 at 06:58

Spark-csv in spark1.6 (or all spark versions lower than 2) does not support partitioning.
Your code would work for spark > 2.0.0.

For your spark version, you will need to prepare the csv first and save it as text (partitioning works forspark-text):

import org.apache.spark.sql.functions.{col,concat_ws}
val key = col("ID")
val concat_col = concat_ws(",",df.columns.map(c=>col(c)):_*) // concat cols to one col
val final_df = df.select(col("ID"),concat_col) // dataframe with 2 columns: id and string 
final_df.write.partitionBy("ID").text("path_hdfs_csv") //save to hdfs

Spark 1.6: Store dataframe into multiple csv file in hdfs (partition by id)

1 Answers1