Content, starts and stops of ChunkedArray - built from lazyarray

Question

I have some code that works fine for JaggedArrays extracting content, starts, stops, but I would like to run the same code on some ChunkedArrays, obtained from lazyarrays from uproot. Unfortunately, I obtained the following error:

~/.local/lib/python3.7/site-packages/awkward/array/base.py in __getattr__(self, where)
    254                     raise AttributeError("while trying to get column {0}, an exception occurred:\n{1}: {2}".format(repr(where), type(err), str(err)))
    255             else:
--> 256                 raise AttributeError("no column named {0}".format(repr(where)))
    257 
    258     def __dir__(self):

AttributeError: no column named 'starts'

Is there any way to make this working?

score 1 · Answer 1 · answered Dec 11 '19 at 01:50

It's gradually becoming clear that properties like starts, stops, and content should not be public—they should be treated as internals. They're part of what makes a JaggedArray work, but not properties of any jagged data (any arrays whose type is [0, N) -> [0, inf) -> X for some array length N and inner type X).

You would like to not care that your jagged data is a ChunkedArray, and for many operations, like flatten() and counts, you don't care: they work equally well on a ChunkedArray as a JaggedArray. But starts, stops, and content won't ever be like that: they don't have meaning beyond the specific implementation of a JaggedArray.

Consider, for instance, this ChunkedArray:

>>> array = awkward.ChunkedArray([awkward.fromiter([[1, 2, 3], [], [4, 5]]),
...                               awkward.fromiter([[100], [200, 300]])])
>>> array
<ChunkedArray [[1 2 3] [] [4 5] [100] [200 300]] at 0x796a969db6a0>

We can get at the starts and stops of each chunk, but it's probably not what you want:

>>> [x.starts for x in array.chunks]
[array([0, 3, 3]), array([0, 1])]
>>> [x.stops for x in array.chunks]
[array([3, 3, 5]), array([1, 3])]

(In your case, you have lazy arrays, which are ChunkedArray of VirtualArray, so you would have to do x.array.starts instead of x.starts, to unpack the VirtualArray. This is another property that should probably be internal, for the same reasons.)

Notice that the starts of the second chunk starts over at 0? That's because the indexes are relative to your current chunk (so that chunks can be processed in parallel). If you were using starts as part of a data analysis, that would be an important correction. (You could add stops[-1] if len(stops) > 0 else 0 of the previous chunk to the current chunk to make the numbers global.)

A perhaps better alternative is to construct meaningful offsets from counts. The ChunkedArray has a functioning counts:

>>> array.counts
array([3, 0, 2, 1, 2])

The counts is the length of each subarray in the jagged data, which is a derivative of the offsets, which is starts and stops overlapped:

>>> offsets = numpy.empty(len(array) + 1, dtype=int)
>>> offsets[0] = 0
>>> numpy.cumsum(array.counts, out=offsets[1:])
array([3, 3, 5, 6, 8])
>>> offsets
array([0, 3, 3, 5, 6, 8])
>>> starts, stops = offsets[:-1], offsets[1:]
>>> starts
array([0, 3, 3, 5, 6])
>>> stops
array([3, 3, 5, 6, 8])

You can use these as starts and stops, but only if you have some kind of "content" for which the starts and stops are completely contiguous. That's not guaranteed for a JaggedArray's content, but flatten() will do that for you:

>>> content = array.flatten()
>>> content
<ChunkedArray [1 2 3 ... 100 200 300] at 0x796aa12b1940>

Now, for instance, the subarray at index 3 is

>>> content[starts[3]:stops[3]]
<ChunkedArray [100] at 0x796a96a3b978>

This is precisely why Awkward 0's ChunkedArray, JaggedArray, etc. are going to become internal "for experts only" classes in Awkward 1. The user interface in Awkward 1 will have a single awkward.Array class with a type to know if it's jagged or not. Whether it's made out of chunks or something else will be an implementation detail.

This makes perfect sense. To be more precise, at the moment, I need `starts`, `stops`, and `content` because I need to apply operations that are not `numpy ufunc`. For this reason, I extract the content of the array, apply my operation, and then build back the resulting array using: ``` awkward.JaggedArray(content=out, starts=starts, stops=stops) ``` Which solution do you suggest for this specific situation? — Nicolò Foppiani, Dec 11 '19 at 01:58
For that specific situation, you can use the `starts`, `stops`, and `content` that I derived above. The result of that will be a `JaggedArray` whose `content` is a `ChunkedArray`, which _should_ work. I hope it doesn't give you any trouble. If you need to make the `ChunkedArray` non-chunked (to pass it through a non-ufunc, for instance), concatenate the chunks: `np.concatenate(content.chunks)` or `np.concatenate([x.array for x in content.chunks])`. — Jim Pivarski, Dec 11 '19 at 02:41

Content, starts and stops of ChunkedArray - built from lazyarray

1 Answers1