I think that I must be missing something obvious here, but I am having trouble writing and reading Spark2.2 DataFrames
to parquet files. I'm trying to do some simple data processing where I load a csv file, nuke out a couple column values, and then save the result. However, spark keeps writing empty parquet files instead of the my data.
/* SparkApp running in loca[*] mode */
val data = spark.read
.option("header", "false")
.option("inferSchema", false)
.option("mode", "DROPMALFORMED")
.option("delimiter", "|")
.schema(My.schema)
.csv("/path/to/my/data.csv")
data.count
//res: Long=100000
val parquetPath = "/path/to/data.parquet"
data.write.mode("overwrite").parquet(parquetPath)
val newData = spark.read.parquet(parquetPath)
/* newData will have the same (correct) schema as data, but: */
newData.count
//res: Long=0
I can see from the resulting file sizes that the parquet file is not saving down the accompanying data, so I presume that I am doing something wrong with DataFrameWriter
, but I'm not sure what. Do I have to collect or materialize the data before writing or something?
Update
Looks to be a similar issue to this older question.