Selecting deep columns in pyarrow.dataset parquet

Question

Let's say I have a deeply nested arrow table like:

pyarrow.Table
arr: struct<arr: struct<a: list<item: int64 not null> not null, b: list<item: int64 not null> not null> not null>
  child 0, arr: struct<a: list<item: int64 not null> not null, b: list<item: int64 not null> not null> not null
      child 0, a: list<item: int64 not null> not null
          child 0, item: int64 not null
      child 1, b: list<item: int64 not null> not null
          child 0, item: int64 not null
----
arr: [
  -- is_valid: all not null
  -- child 0 type: struct<a: list<item: int64 not null> not null, b: list<item: int64 not null> not null>
    -- is_valid: all not null
    -- child 0 type: list<item: int64 not null>
[[1,2,3],[1,2,3],[1,2,3],[1,2,3]]
    -- child 1 type: list<item: int64 not null>
[[3,4,5],[3,4,5],[3,4,5],[3,4,5]]]

I can write this to a parquet dataset with pyarrow.dataset.write_dataset.

With the now deprecated pyarrow.parquet module, I could choose to read a selection of one or more of the leaf nodes like this:

pf = pa.ParquetDataset("temp.parq/")
pf.read(columns=["arr.arr.a.list.item"])

How do I achieve this with the pyarrow.dataset API? As far as I can tell, I can only select top-level fields, in this case ["arr"], which would get me both leaf nodes, not just one.

Whilst .to_table(columns=) is promising, it loses the original record structure of the data, so that if I needed to pick more than one leaf out of many, they would all be independent:

cf

> ds.to_table().to_pydict()
{'arr': [{'arr': {'a': [1, 2, 3], 'b': [3, 4, 5]}}, ...

> ds.to_table(columns={'leaf': pyarrow.dataset.field('arr', 'arr', 'a')}).to_pydict()
{'leaf': [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]}

leaf: list<item: int64 not null> not null
    child 0, item: int64 not null

but it should be

{'arr': [{'arr': {'a': [1, 2, 3]}}, ..

arr: struct<arr: struct<a: list<item: int64 not null> not null, b: list<item: int64 not null> not null> not null>
  child 0, arr: struct<a: list<item: int64 not null> not null, b: list<item: int64 not null> not null> not null
    child 0, a: list<item: int64 not null> not null
      child 0, item: int64 not null

0x26res · Accepted Answer · 2022-09-14T14:48:40.150

Your schema is the following:

import pyarrow as pa

schema = pa.schema(
    [
        pa.field(
            "arr",
            pa.struct(
                [
                    pa.field(
                        "arr",
                        pa.struct(
                            [
                                pa.field(
                                    "a",
                                    pa.list_(pa.int64())),
                                pa.field(
                                    "b",
                                    pa.list_(pa.int64()))
                            ]
                        ))
                ]
            ))
    ])

and you want to remove b so the sub schema is this one:

sub_schema = pa.schema(
    [
        pa.field(
            "arr",
            pa.struct(
                [
                    pa.field("arr",
                             pa.struct(
                                 [
                                     pa.field("a", pa.list_(pa.int64())),
                                 ]
                             ))
                ]
            ))
    ])

I can't think of a way to achieve is with dataset. to_table uses from_dataset which would flatten the extracted field.

Also it doesn't look like changing the schema of the dataset works: ds.replace_schema(sub_schema) throws an ArrowTypeError

But you can load the dataset in a table and cast that table, which works:

table = ds.to_table()
table.cast(sub_schema)

Another option is to provide the sub_schema when loading the dataset:

pyarrow.dataset.dataset('./my_ds/',  schema=sub_schema).to_table()

Thanks. Also useful, but unfortunately not the equivalent of what deprecated pyarrow.parquet can do (i.e., not load the unnecessary columns at all). — mdurant, Sep 14 '22 at 14:19
You could provide the sub_schema when loading (see my edit). — 0x26res, Sep 14 '22 at 14:49
Great! How do you select the set of leaf columns from the schema to make the subschema - you have done it by hand. — mdurant, Sep 14 '22 at 14:55
I think you'd have to write a function that returns a modified version of the original schema, but it's not trivial. — 0x26res, Sep 14 '22 at 15:14
This answer contains the way forward, so I will accept it. Perhaps we can edit it to be a bit clearer. Also, for my specific use case, awkward-array can easily manipulate schemas and produce the paired arrow schema for me. — mdurant, Sep 15 '22 at 13:53

score 1 · Answer 2 · answered Sep 09 '22 at 22:45

1

The dataset field expression supports multiple (or tuple of) names.

import pyarrow.dataset as ds
 
dataset.to_table(columns={'leaf': ds.field('arr', 'a', ...)})

answered Sep 09 '22 at 22:45

A. Coady

54,452
8
34
40

Thanks for the suggestion. I have updated my question to show why this is not what I am after. – mdurant Sep 10 '22 at 17:43

Selecting deep columns in pyarrow.dataset parquet

2 Answers2