Write nested parquet format from Python

Question

I have a flat parquet file where one varchar columns store JSON data as a string and I want to transform this data to a nested structure, i.e. the JSON data becomes nested parquet. I know the schema of the JSON in advance if this is of any help.

Here is what I have "accomplished" so far:

Building the sample data

# load packages

import pandas as pd
import json
import pyarrow as pa
import pyarrow.parquet as pq

# Create dummy data

# dummy data with JSON as string
person_data = {'Name':  ['Bob'],
        'Age': [25],
        'languages': "{'mother_language': 'English', 'other_languages': ['German', 'French']}"     
        }

# from dict to panda df
person_df = pd.DataFrame.from_dict(person_data)

# from panda df to pyarrow table
person_pat = pa.Table.from_pandas(person_df)

# save as parquet file
pq.write_table(person_pat, 'output/example.parquet')

Script proposal

# load dummy data
sample = pa.parquet.read_table('output/example.parquet')

# transform to dict
sample_dict = sample.to_pydict()
# print with indent for checking
print(json.dumps(sample_dict, sort_keys=True, indent=4))
# load json from string and replace string
sample_dict['languages'] = json.loads(str(sample_dict['languages']))
print(json.dumps(sample_dict, sort_keys=True, indent=4))
#type(sample_dict['languages'])

# how to keep the nested structure when going from dict —> panda df —> pyarrow table?
# save dict as nested parquet...

So, I here are my specific questions:

Is this approach the way to go or can it be optimised in any way? All the transformations between dict, df and pa table does not feel efficient, so happy to get educated here.
How can I preserve the nested structure when doing the dict —> df transformation? Or is this not needed at all?
What is the best way to write the nested parquet file? I have read Nested data in Parquet with Python and here fast parquet is mentioned for reading but with lacking writing capability - is there any working solution in the meantime?

Can you use PySpark for that? I think it should be much easier using it. If you want I can write a solution using PySpark and you can decide whether it's a good idea using it — Oscar Lopez M., Jul 10 '20 at 08:04
Writing nested data in your case doesn't seem to be supported, have u checked https://issues.apache.org/jira/browse/ARROW-1644? I'd suggest using Pyspark — Swetha Shanmugam, Jul 14 '20 at 14:59
Hi @OscarLopezM., sorry, I was out for a while. The solution using PySpark would be very much appreciated. Thanks a lot already! — Stephan Claus, Jul 15 '20 at 14:37

score 4 · Accepted Answer · answered Jul 15 '20 at 21:08

PySpark can do it in a simple way as I show below. The main benefit of using PySpark is the scalability of the infrastructure as data grows, but using plain Python that can be problematic as if you don't use a framework like Dask, you will need bigger machines to run it.

from pyspark.sql import HiveContext
hc = HiveContext(sc)

# This is a way to create a PySpark dataframe from your sample, but there are others 
nested_df = hc.read.json(sc.parallelize(["""
{'Name':  ['Bob'],
        'Age': [25],
        'languages': "{'mother_language': 'English', 'other_languages': ['German', 'French']}"     
        }
"""]))

# You have nested Spark dataframe here. This shows the content of the spark dataframe. 20 is the max number of rows to show on the console and False means don't cut the columns that don't fit on the screen (show all columns content)
nested_df.show(20,False)

# Writes to a location as parquet
nested_df.write.parquet('/path/parquet')

# Reads the file from the previous location
spark.read.parquet('/path/parquet').show(20, False)

The output of this code is

+----+-----+-----------------------------------------------------------------------+
|Age |Name |languages                                                              |
+----+-----+-----------------------------------------------------------------------+
|[25]|[Bob]|{'mother_language': 'English', 'other_languages': ['German', 'French']}|
+----+-----+-----------------------------------------------------------------------+

+----+-----+-----------------------------------------------------------------------+
|Age |Name |languages                                                              |
+----+-----+-----------------------------------------------------------------------+
|[25]|[Bob]|{'mother_language': 'English', 'other_languages': ['German', 'French']}|
+----+-----+-----------------------------------------------------------------------+

To answer your questions

I think this is more efficient in the sense that it doesn't matter how much data you have if you can use more executors in Spark
You can see that when the parquet file is loaded all dict and lists are preserved
It depends on the definition of "best", but I think that's a good option ;)

Hi Oscar, thanks for the answer. When running this in Jupyter I get an error: NameError: name 'sc' is not defined. Any recommendation how to solve this? — Stephan Claus, Jul 16 '20 at 11:28
Hi Stephan, that's no problem. The `sc` refers to the sparkContext object. It depends on the version of Spark you are using, you may have to use it as such or on a different way. Actually that part is only to simulate the input as I assume you have a json file instead of a string like that. Can you please try `hc = HiveContext(spark.sparkContext)` and see if that works? Please notice that I am replacing `sc` by `spark.sparkContext`. — Oscar Lopez M., Jul 16 '20 at 11:56
Actually if you want to read from a json file you should use `spark.read.json(path_to_your_file)` and it will load the dataframe for you directly, instead of pretending to have a row. — Oscar Lopez M., Jul 16 '20 at 11:59
Thanks for the swift reply. I replaced as you told and now getting a NameError: name 'spark' is not defined. I have running spark running on http://localhost:4040/environment/ but not sure if I need to import anything else. — Stephan Claus, Jul 16 '20 at 12:01
Another ways to read that single line can be seen https://stackoverflow.com/questions/49399245/read-json-file-as-pyspark-dataframe-using-pyspark. Please let me know if you have more questions on that. Thank you — Oscar Lopez M., Jul 16 '20 at 12:01
Are you using docker? Normally jupyter (if I am correct) works by default on the port 8888, so I don't think you are running it properly. Can you please try http://localhost:8888/ and see — Oscar Lopez M., Jul 16 '20 at 12:03
That http://localhost:4040/environment/ is the SparkUI we app that shows the configuration of the Spark environment you have. You can use it to see how many executors and features they have, how is the driver, among other multiple things — Oscar Lopez M., Jul 16 '20 at 12:10
Hi Oscar, thanks for your support, I made it work by adding the following lines to the script: sc = SparkContext.getOrCreate() spark = SparkSession(sc) hc = HiveContext(sc) — Stephan Claus, Jul 17 '20 at 12:13

Write nested parquet format from Python

1 Answers1