5

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be:

import pandas as pd

df = pd.DataFrame.from_records([ 
    (1, [{'job_id': 1, 'started': '2019-07-04'}, {'job_id': 2, 'started': '2019-05-04'}], 100), 
    (5, [{'job_id': 3, 'started': '2015-06-04'}, {'job_id': 9, 'started': '2019-02-02'}], 540)], 
    columns=['uid', 'job_history', 'latency'] 
) 

The when using engine='fastparquet, Dask reads all other columns fine but returns a column of Nones for the column with the complex type. When I set engine='pyarrow', I get the following exception:

ArrowNotImplementedError: lists with structs are not supported.

A lot of googling has made it clear that reading a column with a Nested Array just isn't really supported right now, and I'm not totally sure what the best way to handle this is. I figure my options are:

  • Some how tell dask/fastparquet to parse the column using the standard json library. The schema is simple and that would do the job if possible
  • See if I can possibily re-run the Spark job that generated the output and save it as something else, though this almost isn't an acceptable solution since my company uses parquet everywhere
  • Turn the keys of the map into columns and break the data up across several columns with dtype list and note that the data across these columns are related/map to each other by index (e.g. the elements in idx 0 across these keys/columns all came from the same source). This would work, but frankly, breaks my heart :(

I'd love to hear how others have navigated around this limitation. My company uses nested arrays in their parquest frequently, and I'd hate to have to let go of using Dask because of this.

Jon.H
  • 794
  • 2
  • 9
  • 23

2 Answers2

3

It would be fairer to say that pandas does not support non-simple types very well (currently). It may be the case that pyarrow will, without conversion to pandas, and that as some future point, pandas will use these arrow structures directly.

Indeed, the most direct method that I can think for you to use, is to rewrite the columns as B/JSON-encoded text, and then load with fastparquet, specifying to load using B/JSON. You should get lists of dicts in the column, but the performance will be slow.

Note that the old project oamap and its successor awkward provides a way to iterate and aggregate over nested list/map/struct trees using python syntax, but compiled with Numba, such that you never need to instantiate the intermediate python objects. They were not designed for parquet, but had parquet compatibility, so might just be useful to you.

mdurant
  • 27,272
  • 5
  • 45
  • 74
3

I'm dealing with pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported when I try to read using Pandas; however, when I read using pyspark and then convert to pandas, the data at least loads:

import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.read.load(path)
pdf = df.toPandas()

and the offending field is now rendered as a pyspark Row object, which have some structured parsing but you would have to probably write custom pandas functions to extract data from them:

>>> pdf["user"][0]["sessions"][0]["views"]
[Row(is_search=True, price=None, search_string='ABC', segment='listing', time=1571250719.393951), Row(is_search=True, price=None, search_string='ZYX', segment='homepage', time=1571250791.588197), Row(is_search=True, price=None, search_string='XYZ', segment='listing', time=1571250824.106184)]

the individual record can be rendered as a dictionary, simply call .asDict(recursive=True) on the Row object you would like.

Unfortunately, it takes ~5 seconds to start the SparkSession context and every spark action also takes much longer than pandas operations (for small to medium datasets) so I would greatly prefer a more python-native option

steeles
  • 169
  • 1
  • 11
  • I also tried reading in with Spark and casting to pd.Dataframe, but in my case it didn't work out. Calling `toPandas()` would result get me the same results as trying to read with fastparquet; I would end up with a column of `None`s for any column with a complex type. – Jon.H Nov 06 '19 at 19:11
  • Did you find a solution yet ? I am dealing with the same thing. Is there any other way to read the parquet files without converting them to pandas dataframe @Jon.H – Nilan Saha Aug 24 '20 at 17:42
  • @NilanSaha what we ended up doing is transforming the column into a JSON string, then on read, calling `.loads()` to convert it back. It's not at all pretty, but it does work. You can follow the progress on this problem but watching the relevant ticket here: https://issues.apache.org/jira/browse/ARROW-1644 . It looks like there's active development. Hopefully we'll have a solution soon. – Jon.H Aug 24 '20 at 19:47
  • I am sorry. I did not get that. Can you elaborate on that ? Did you end up using PySpark and do you happen to have a POC ? @Jon.H – Nilan Saha Aug 24 '20 at 20:12