I have a file that has one JSON per line. Here is a sample:
{
"product": {
"id": "abcdef",
"price": 19.99,
"specs": {
"voltage": "110v",
"color": "white"
}
},
"user": "Daniel Severo"
}
I want to create a parquet file with columns such as:
product.id, product.price, product.specs.voltage, product.specs.color, user
I know that parquet has a nested encoding using the Dremel algorithm, but I haven't been able to use it in python (not sure why).
I'm a heavy pandas and dask user, so the pipeline I'm trying to construct is json data -> dask -> parquet -> pandas
, although if anyone has a simple example of creating and reading these nested encodings in parquet using Python I think that would be good enough :D
EDIT
So, after digging in the PRs I found this: https://github.com/dask/fastparquet/pull/177
which is basically what I want to do. Although, I still can't make it work all the way through. How exactly do I tell dask/fastparquet that my product
column is nested?
- dask version: 0.15.1
- fastparquet version: 0.1.1