19

I have a file that has one JSON per line. Here is a sample:

{
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

I want to create a parquet file with columns such as:

product.id, product.price, product.specs.voltage, product.specs.color, user

I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).

I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D

EDIT

So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177

which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product column is nested?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Daniel Severo
  • 1,768
  • 2
  • 15
  • 22
  • 1
    fastparquet can probably *read* a parquet file structured as above, but not of writing them. This is because a pandas dataframe (the target structure) would rarely look like that. You could flatten the schema yourself to a pandas dataframe, and any repeated values (lists, dicts) you could encode using JSON (object_encoding={'specs': 'JSON'}) on write. – mdurant Jul 27 '17 at 13:07
  • (NB: writing of MAP and LIST parquet types is doable for fastparquet, but seemed to me like more effort than demand can justify) – mdurant Jul 27 '17 at 13:18
  • Did something like that. I'll post an example as an answer here soon. Thanks! – Daniel Severo Jul 29 '17 at 17:36
  • @DanielSevero Out of curiosity, did you ever find a solution? – Pylander Jan 09 '18 at 00:22

3 Answers3

18

Implementing the conversions on both the read and write path for arbitrary Parquet nested data is quite complicated to get right -- implementing the shredding and reassembly algorithm with associated conversions to some Python data structures. We have this on the roadmap in Arrow / parquet-cpp (see https://github.com/apache/parquet-cpp/tree/master/src/parquet/arrow), but it has not been completed yet (only support for simple structs and lists/arrays are supported now). It is important to have this functionality because other systems that use Parquet, like Impala, Hive, Presto, Drill, and Spark, have native support for nested types in their SQL dialects, so we need to be able to read and write these structures faithfully from Python.

This can be analogously implemented in fastparquet as well, but it's going to be a lot of work (and test cases to write) no matter how you slice it.

I will likely take on the work (in parquet-cpp) personally later this year if no one beats me to it, but I would love to have some help.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • 2
    Awesome! I found a workaround for now (probably not the smartest way). I'm gonna create a .ipynb with a working example of my solution. I'm sure more people have this issue. Do you have any examples of how to use the current nested functionalities with `pyarrow`? – Daniel Severo Jul 29 '17 at 17:33
  • @wes-mckinney : if someone wanted to write a structures parquet data-set like this, then what do you suppose the input data looks like? Can arrow handle such nested things, or are we talking python objects (dicts)? – mdurant Jul 29 '17 at 20:34
  • Arrow has native lists (arrays), structs, maps, etc., so you could convert to Arrow nested data before writing (we'll need some functions to simplify converting between built-in Python data structures and Arrow data) – Wes McKinney Jul 30 '17 at 13:59
  • @DanielSevero there is limited support for reading lists and structs (e.g. Arrow structs back to Python dicts was implemented here https://github.com/apache/arrow/commit/e268ce87e5d8f93ff6ecc871d218a58605b9692f) -- I don't think the write path is complete. Someone will need to spend a few days implementing complete nested data support and adding documentation. – Wes McKinney Jul 30 '17 at 14:13
  • @WesMcKinney Anything new on this, Wes? Thanks – Czechnology Feb 14 '18 at 09:37
  • 7
    Still waiting on some development help with this. I expect it to be completed this year (i.e. in 2018) but not sure when – Wes McKinney Feb 15 '18 at 15:26
  • @WesMcKinney Is this still on the project roadmap? Thx very much – Pylander Jan 25 '19 at 01:18
  • 8
    @WesMcKinney Was this ever completed? – bradchattergoon Dec 05 '19 at 23:36
  • 2
    According to the links below this was not implemented as of 2020-02-23, but planned for 2020 .https://lists.apache.org/thread.html/r831c41a3f9b353cfc187dc2092515fe707d4ebf13cbb727524f1c600%40%3Cdev.arrow.apache.org%3E https://issues.apache.org/jira/browse/ARROW-1644?src=confmacro – keiv.fly Feb 23 '20 at 16:42
1

This is not exactly the right answer, but it can helps.

We could try to convert your dictionary to a pandas DataFrame, and after this write this to .parquet file:

import pandas as pd
from fastparquet import write, ParquetFile

d = {
    "product": {
        "id": "abcdef",
        "price": 19.99,
        "specs": {
            "voltage": "110v",
            "color": "white"
        }
    },
    "user": "Daniel Severo"
}

df_test = pd.DataFrame(d)
write('file_test.parquet', df_test)

This would raise and error:

ValueError: Can't infer object conversion type: 0                                   abcdef
1                                    19.99
2    {'voltage': '110v', 'color': 'white'}
Name: product, dtype: object

So a easy solution is to convert the product column to lists:

df_test['product'] = df_test['product'].apply(lambda x: [x])

# this should now works
write('file_test.parquet', df_test)

# and now compare the file with the initial DataFrame
ParquetFile('file_test.parquet').to_pandas().explode('product')
    index            product                                 user
0   id               abcdef                             Daniel Severo
1   price             19.99                             Daniel Severo
2   specs   {'voltage': '110v', 'color': 'white'}       Daniel Severo
igorkf
  • 3,159
  • 2
  • 22
  • 31
1

I believe this feature has finally been added in arrow/pyarrow 2.0.0:

https://issues.apache.org/jira/browse/ARROW-1644

https://arrow.apache.org/docs/python/json.html

Pylander
  • 1,531
  • 1
  • 17
  • 36