I'm having a strange issue, that I think might be a bug in spark and/or pandas but I'm not sure if it might be user error on my part instead. It is similar to this bug which relates to this resolved issue, but it's not quite the same.
Long story short, I have a pyspark dataframe with four columns, the fourth of which is a very long string (which is actually a list of key/value pairs which I will later unpack, but it's more efficient to store them as a string for this part of the process). When I do df.print_schema()
I see this:
root
|-- attribute: string (nullable = true)
|-- id: long (nullable = true)
|-- label: long (nullable = true)
|-- featureString: string (nullable = true)
My goal is to write this to a table which (by default on my cluster) is stored in s3 as parquet. Then I will be reading each individual parquet into python later on a separate server with pd.read_parquet
.
So, when I run:
df.select('attribute','id', 'label', 'featureString')\
.write.saveAsTable('db_name.table_name1', mode='overwrite')
Then I can do pd.read_parquet()
on the individual files in s3 and it works fine. However, I actually want to have each file be all the rows for a given value of the attribute
column, so I do:
df.select('attribute','id', 'label', 'featureString')\
.repartition('attribute')\
.write.saveAsTable('db_name.table_name2', mode='overwrite')
But then when I try to read some (but not all) of the files in with pd.read_parquet
I get ArrowIOError: Invalid parquet file. Corrupt footer.
which is the exact error from the issue I linked above.
It also seems like it's the bigger partitions (~4 GB or so) that can't be read back in, which is also similar to that issue (it was only with large files). However, that issue was about reading back in files that had been written with pd.to_parquet()
and I'm writing with the pyspark write().saveAsTable()
command.
Anyway, I'm flumoxed by this. Any help would be much appreciated.
PS- I'm using spark 2.3 and pandas 0.23 in python 3.6