1

I encountered the following issue making some tests to demonstrate the usefulness of a pure pyarrow UDF in pyspark as compared to always going through pandas.

import awkward
import numpy
import pandas
import pyarrow

counts = numpy.random.randint(0,20,size=200000)
content = numpy.random.normal(size=counts.sum())
test_jagged = awkward.JaggedArray.fromcounts(counts, content)
test_arrow = awkward.toarrow(test_jagged)

def awk_arrow(col):
    jagged = awkward.fromarrow(col)
    jagged2 = jagged**2
    return awkward.toarrow(jagged2)

def pds_arrow(col):
    pds = col.to_pandas()
    pds2 = pds**2
    return pyarrow.Array.from_pandas(pds2)

out1 = awk_arrow(test_arrow)
out2 = pds_arrow(test_arrow)

out3 = awkward.fromarrow(out1)
out4 = awkward.fromarrow(out2)

type(out3)
type(out4)

yields

<class 'awkward.array.jagged.JaggedArray'>
<class 'awkward.array.masked.BitMaskedArray'>

and

out3 == out4

yields (at the end of the stack trace):

AttributeError: no column named 'reshape' 

looking at the arrays:

print(out3);print();print(out4);

[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]

[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]

You can see the contents and shape of the arrays are the same, but they're not comparable to each other at face value, which is very counter intuitive. Is there a good reason for dense jagged structures with no Nulls to be represented as a BitMaskedArray?

1 Answers1

1

All data in Arrow are nullable (at every level), and they use bit masks (as opposed to byte masks) to specify which elements are valid. The specification allows columns of entirely valid data to not write the bitmask, but not every writer takes advantage of that freedom. Quite often, you see unnecessary bitmasks.

When it encounters a bitmask, such as here, awkward inserts a BitMaskedArray.

It could be changed to check to see if the mask is unnecessary and skip that step, though that adds an operation that scales with the size of the dataset (though likely insignificant in most cases—bitmasks are 8 times faster to check than bytemasks). It's also a little complicated: the last byte may be incomplete if the length of the dataset is not a multiple of 8. One would need to check these bits individually, but the rest of the mask could be checked in bulk. (Maybe even cast as int64 to check 64 flags at a time.)

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • Got it. Still, regardless of the mask, I think I should be able to compare the raw JaggedArray and the BitMaskArray without the need for asking for `content.content.content` on the BitMaskArray. I think this is a rather deep edge case since I don't think many people will be reading in arrow written from pandas and arrow written from awkward in the same job. But I'm sure someone will eventually do it. – Lindsey Gray Dec 05 '19 at 19:36
  • That's why `awkward1.Array` will hide the distinction between equivalent types like this. – Jim Pivarski Dec 06 '19 at 20:34