I am trying to write a Dataframe like this to Parquet:
| foo | bar |
|-----|-------------------|
| 1 | {"a": 1, "b": 10} |
| 2 | {"a": 2, "b": 20} |
| 3 | {"a": 3, "b": 30} |
I am doing it with Pandas and Fastparquet:
df = pd.DataFrame({
"foo": [1, 2, 3],
"bar": [{"a": 1, "b": 10}, {"a": 2, "b": 20}, {"a": 3, "b": 30}]
})
import fastparquet
fastparquet.write('/my/parquet/location/toy-fastparquet.parq', df)
I would like to load Parquet in (py)Spark, and query the data with Spark SQL, like:
df = spark.read.parquet("/my/parquet/location/")
df.registerTempTable('my_toy_table')
result = spark.sql("SELECT * FROM my_toy_table WHERE bar.b > 15")
My issue is that, even though fastparquet
can read its Parquet file correctly (the bar
field is correctly deserialized as a Struct), in Spark, bar
is read as a column of type String, that just contains a JSON representation of the original structure:
In [2]: df.head()
Out[2]: Row(foo=1, bar='{"a": 1, "b": 10}')
I tried writing Parquet from PyArrow, but no luck there: ArrowNotImplementedError: Level generation for Struct not supported yet
. I have also tried passing file_scheme='hive'
to Fastparquet, but I got the same results. Changing Fastparquet serialization to BSON (object_encoding='bson'
) produced an unreadable binary field.
[EDIT] I see the following approaches:
- [answered] Write Parquet from Spark
- [open] Find a Python library that implements Parquet's specification for nested types, and that is compatible with the way Spark reads them
- [answered] Read Fastparquet files in Spark with specific JSON de-serialization (I suppose this has an impact on performance)
- Do not use nested structures altogether