3

I'm trying to create a pyarrow.StructArray with missing values.

I works fine when I use pyarrow.array passing tuples representing my records:

>>> pyarrow.array(
    [
        None,
        (1, "foo"),
    ],
    type=pyarrow.struct(
        [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
    )
)
-- is_valid:
  [
    false,
    true
  ]
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: string
  [
    "",
    "foo"
  ]

But I want to use the StructArray.from_arrays and as far as I can tell there's no way to provide a mask for missing values:

pyarrow.StructArray.from_arrays(
    [
        [None, 1],
        [None, "foo"]
    ],
    fields=[pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
-- is_valid: all not null
-- child 0 type: int64
  [
    null,
    1
  ]
-- child 1 type: string
  [
    null,
    "foo"
  ]

Is there a way to create a StructArray, from array, specifiying a mask of missing values? Or would there be a way to apply the mask later?

0x26res
  • 11,925
  • 11
  • 54
  • 108
  • 1
    The C++ API allows this (takes in a null bitmap) but the python API doesn't expose it (always passes an empty buffer) at the moment. Can you create a JIRA ticket? – Pace May 07 '21 at 03:13
  • Thanks, I've created a jira and I'll stick to `pa.array` for now. https://issues.apache.org/jira/browse/ARROW-12677 – 0x26res May 07 '21 at 07:48

1 Answers1

3

It would indeed be nice to make this possible by passing a mask in StructArray.from_arrays (-> https://issues.apache.org/jira/browse/ARROW-12677, thanks for opening the issue).

But for now, a possible workaround might be to user the lower-level StructArray.from_buffers:

struct_type = pyarrow.struct(
    [pyarrow.field('col1', pyarrow.int64()), pyarrow.field("col2", pyarrow.string())]
)
col1 = pyarrow.array([None, 1])
col2 = pyarrow.array([None, "foo"])

Creating a pyarrow mask array to construct a validity buffer:

mask = np.array([True, False])
validity_mask = pyarrow.array(~mask)
validity_bitmask = validity_mask.buffers()[1]

And then we can use this as the first buffer in from_buffers to indicate the missing values in the StructArray:

>>> pyarrow.StructArray.from_buffers(struct_type, len(col1), [validity_bitmask], children=[col1, col2])
<pyarrow.lib.StructArray object at 0x7f8b560fa2e0>
-- is_valid:
  [
    false,
    true
  ]
-- child 0 type: int64
  [
    null,
    1
  ]
-- child 1 type: string
  [
    null,
    "foo"
  ]
joris
  • 133,120
  • 36
  • 247
  • 202