Spark DataFrameWriter saves parquet file without any data

Asked Nov 06 '17 at 18:49

Active Nov 06 '17 at 19:00

Viewed 1,265 times

I think that I must be missing something obvious here, but I am having trouble writing and reading Spark2.2 DataFrames to parquet files. I'm trying to do some simple data processing where I load a csv file, nuke out a couple column values, and then save the result. However, spark keeps writing empty parquet files instead of the my data.

/* SparkApp running in loca[*] mode */
val data = spark.read
    .option("header", "false")
    .option("inferSchema", false)
    .option("mode", "DROPMALFORMED")
    .option("delimiter", "|")
    .schema(My.schema)
    .csv("/path/to/my/data.csv")

data.count
//res: Long=100000

val parquetPath = "/path/to/data.parquet"
data.write.mode("overwrite").parquet(parquetPath)

val newData = spark.read.parquet(parquetPath)
/* newData will have the same (correct) schema as data, but: */

newData.count
//res: Long=0

I can see from the resulting file sizes that the parquet file is not saving down the accompanying data, so I presume that I am doing something wrong with DataFrameWriter, but I'm not sure what. Do I have to collect or materialize the data before writing or something?

Update

Looks to be a similar issue to this older question.

edited Nov 06 '17 at 19:00

asked Nov 06 '17 at 18:49

Derek Kaknes

What does `data.show` give you? – manojlds Nov 06 '17 at 18:54
An empty dataframe. Has all the correct columns but no rows. And `data.schema` is equal to `newData.schema`, but the actual data is missing. – Derek Kaknes Nov 06 '17 at 18:58
it seems there is some problem in local mode, seen in one of comments – loneStar Nov 06 '17 at 21:30
@Achyuth Comment on JIRA or somewhere? – Derek Kaknes Nov 07 '17 at 03:42

Spark DataFrameWriter saves parquet file without any data

0 Answers0