1

I think that I must be missing something obvious here, but I am having trouble writing and reading Spark2.2 DataFrames to parquet files. I'm trying to do some simple data processing where I load a csv file, nuke out a couple column values, and then save the result. However, spark keeps writing empty parquet files instead of the my data.

/* SparkApp running in loca[*] mode */
val data = spark.read
    .option("header", "false")
    .option("inferSchema", false)
    .option("mode", "DROPMALFORMED")
    .option("delimiter", "|")
    .schema(My.schema)
    .csv("/path/to/my/data.csv")

data.count
//res: Long=100000

val parquetPath = "/path/to/data.parquet"
data.write.mode("overwrite").parquet(parquetPath)

val newData = spark.read.parquet(parquetPath)
/* newData will have the same (correct) schema as data, but: */

newData.count
//res: Long=0

I can see from the resulting file sizes that the parquet file is not saving down the accompanying data, so I presume that I am doing something wrong with DataFrameWriter, but I'm not sure what. Do I have to collect or materialize the data before writing or something?

Update

Looks to be a similar issue to this older question.

Derek Kaknes
  • 961
  • 8
  • 10

0 Answers0