4

I am trying to write a Dataframe like this to Parquet:

| foo | bar               |
|-----|-------------------|
|  1  | {"a": 1, "b": 10} |
|  2  | {"a": 2, "b": 20} |
|  3  | {"a": 3, "b": 30} |

I am doing it with Pandas and Fastparquet:

df = pd.DataFrame({
    "foo": [1, 2, 3],
    "bar": [{"a": 1, "b": 10}, {"a": 2, "b": 20}, {"a": 3, "b": 30}]
})

import fastparquet
fastparquet.write('/my/parquet/location/toy-fastparquet.parq', df)

I would like to load Parquet in (py)Spark, and query the data with Spark SQL, like:

df = spark.read.parquet("/my/parquet/location/")
df.registerTempTable('my_toy_table')
result = spark.sql("SELECT * FROM my_toy_table WHERE bar.b > 15")

My issue is that, even though fastparquet can read its Parquet file correctly (the bar field is correctly deserialized as a Struct), in Spark, bar is read as a column of type String, that just contains a JSON representation of the original structure:

In [2]: df.head()                                                                                                                                                                                           
Out[2]: Row(foo=1, bar='{"a": 1, "b": 10}')

I tried writing Parquet from PyArrow, but no luck there: ArrowNotImplementedError: Level generation for Struct not supported yet. I have also tried passing file_scheme='hive' to Fastparquet, but I got the same results. Changing Fastparquet serialization to BSON (object_encoding='bson') produced an unreadable binary field.

[EDIT] I see the following approaches:

  • [answered] Write Parquet from Spark
  • [open] Find a Python library that implements Parquet's specification for nested types, and that is compatible with the way Spark reads them
  • [answered] Read Fastparquet files in Spark with specific JSON de-serialization (I suppose this has an impact on performance)
  • Do not use nested structures altogether
Dario Chi
  • 43
  • 1
  • 1
  • 6
  • That's indeed a limitation of Arrow at the moment, see https://issues.apache.org/jira/browse/ARROW-1644 – joris Feb 14 '20 at 13:39
  • Thank you @joris, my DF does not contain a mix of list and struct, just a struct field (I made the description clearer). However, it seems that this case isn't supported either at the moment. – Dario Chi Feb 14 '20 at 14:37
  • Have you tried passing the `schema` while load the data? – Cesar A. Mostacero Feb 14 '20 at 16:19
  • @cesar-a-mostacero I tried, but it didn't work because I was missing the JSON decoding that Alexandros explained in the answer below – Dario Chi Feb 18 '20 at 08:56

1 Answers1

4

You have at least 3 options here:

Option 1:

You don't need to use any extra libraries like fastparquet since Spark provides that functionality already:

pdf = pd.DataFrame({
    "foo": [1, 2, 3],
    "bar": [{"a": 1, "b": 10}, {"a": 2, "b": 20}, {"a": 3, "b": 30}]
})

df = spark.createDataFrame(pdf)
df.write.mode("overwrite").parquet("/tmp/parquet1")

If try to load your data with df = spark.read.parquet("/tmp/parquet1") the schema will be:

StructType([ 
            StructField("foo", LongType(), True),
            StructField("bar",MapType(StringType(), LongType(), True), True)])

As you can see in this case Spark will retain the correct schema.

Option 2:

If for any reason still need to use fastparquet then bar will be treated as string therefore you can load bar as a string and then convert it to JSON using from_json function. In your case we will handle the json as a dictionary of Map(string, int). This is for our own convenience since the data seems to be a sequence of key/value which can be perfectly represented by a dictionary:

from pyspark.sql.types import StringType, MapType,LongType
from pyspark.sql.functions import from_json

df = spark.read.parquet("/tmp/parquet1")

# schema should be a Map(string, string) 
df.withColumn("bar", from_json("bar", MapType(StringType(), LongType()))).show()

# +---+-----------------+
# |foo|              bar|
# +---+-----------------+
# |  1|[a -> 1, b -> 10]|
# |  2|[a -> 2, b -> 20]|
# |  3|[a -> 3, b -> 30]|
# +---+-----------------+

Option 3:

If you your schema does not change and you know that each value of bar will always have the same combination of fields (a, b) you can also convert bar into a struct:

schema = StructType([ 
                    StructField("a", LongType(), True),
                    StructField("b", LongType(), True)
            ])

df = df.withColumn("bar", from_json("bar", schema))

df.printSchema()

# root
#  |-- foo: long (nullable = true)
#  |-- bar: struct (nullable = true)
#  |    |-- a: long (nullable = true)
#  |    |-- b: long (nullable = true)

Example:

Then you can run your code with:

df.registerTempTable('my_toy_table')

spark.sql("SELECT * FROM my_toy_table WHERE bar.b > 20").show()
# or spark.sql("SELECT * FROM my_toy_table WHERE bar['b'] > 20")

# +---+-----------------+
# |foo|              bar|
# +---+-----------------+
# |  3|[a -> 3, b -> 30]|
# +---+-----------------+
abiratsis
  • 7,051
  • 3
  • 28
  • 46
  • id probably go with option 2 because it fits in with what OP is trying to accomplish using pyspark function from_json – murtihash Feb 16 '20 at 23:58
  • Thank you @alexandros-biratsis for this great answer! Option 1 would be perfect, but unfortunately the Parquet's producer doesn't run Spark. So it looks like Fastparquet does not implement the native Map specification of Parquet (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#nested-types); instead they serialize to JSON and I have to de-serialize in Spark... I wonder if another library exists that does that. If not, I agree that Option 2 is the way to go – Dario Chi Feb 17 '20 at 10:40
  • Hi Dario, unfortunately I don't really know if there is such a library supporting struct writings. – abiratsis Feb 17 '20 at 11:13