1

I'm trying to save a dataframe as CSV file partitioned by a column.

val schema = new StructType(
      Array(
        StructField("ID",IntegerType,true),
        StructField("State",StringType,true),
        StructField("Age",IntegerType,true)
      )
)

val df = sqlContext.read.format("com.databricks.spark.csv")
        .options(Map("path" -> filePath).schema(schema).load()

df.write.partitionBy("State").format("com.databricks.spark.csv").save(outputPath)

But the output is not saved with any partition info. It looks like partitionBy was completely ignored. There were no errors. It works if I try the same with parquet format.

df.write.partitionBy("State").parquet(outputPath)

What am I missing here?

Vijay Krishna
  • 1,037
  • 13
  • 19
Cheeko
  • 1,193
  • 1
  • 12
  • 23

1 Answers1

2

partitionBy support has to be implemented as a part of a given data source and as for now (v1.3) is not supported in Spark CSV. See: https://github.com/databricks/spark-csv/issues/123

zero323
  • 322,348
  • 103
  • 959
  • 935
  • There are no comments on if/when this will be available. In the interim, any thoughts on an efficient way to doing this application code? – Cheeko Feb 09 '16 at 04:20
  • It looks like csv parsing will be a part of core Spark SQL in 2.x... – zero323 Feb 17 '16 at 10:07
  • Can you provide link to release notes or blog that talks about this? Would like more info. Thanks! – Cheeko Feb 17 '16 at 18:07
  • I am afraid I cannot but see a link I've provided in the comments to http://stackoverflow.com/a/35372282/1560062 – zero323 Feb 17 '16 at 23:07