It's gradually becoming clear that properties like starts
, stops
, and content
should not be public—they should be treated as internals. They're part of what makes a JaggedArray
work, but not properties of any jagged data (any arrays whose type
is [0, N) -> [0, inf) -> X
for some array length N
and inner type X
).
You would like to not care that your jagged data is a ChunkedArray
, and for many operations, like flatten()
and counts
, you don't care: they work equally well on a ChunkedArray
as a JaggedArray
. But starts
, stops
, and content
won't ever be like that: they don't have meaning beyond the specific implementation of a JaggedArray
.
Consider, for instance, this ChunkedArray
:
>>> array = awkward.ChunkedArray([awkward.fromiter([[1, 2, 3], [], [4, 5]]),
... awkward.fromiter([[100], [200, 300]])])
>>> array
<ChunkedArray [[1 2 3] [] [4 5] [100] [200 300]] at 0x796a969db6a0>
We can get at the starts
and stops
of each chunk
, but it's probably not what you want:
>>> [x.starts for x in array.chunks]
[array([0, 3, 3]), array([0, 1])]
>>> [x.stops for x in array.chunks]
[array([3, 3, 5]), array([1, 3])]
(In your case, you have lazy arrays, which are ChunkedArray
of VirtualArray
, so you would have to do x.array.starts
instead of x.starts
, to unpack the VirtualArray
. This is another property that should probably be internal, for the same reasons.)
Notice that the starts
of the second chunk
starts over at 0
? That's because the indexes are relative to your current chunk
(so that chunks
can be processed in parallel). If you were using starts
as part of a data analysis, that would be an important correction. (You could add stops[-1] if len(stops) > 0 else 0
of the previous chunk
to the current chunk
to make the numbers global.)
A perhaps better alternative is to construct meaningful offsets from counts
. The ChunkedArray
has a functioning counts
:
>>> array.counts
array([3, 0, 2, 1, 2])
The counts
is the length of each subarray in the jagged data, which is a derivative of the offsets
, which is starts
and stops
overlapped:
>>> offsets = numpy.empty(len(array) + 1, dtype=int)
>>> offsets[0] = 0
>>> numpy.cumsum(array.counts, out=offsets[1:])
array([3, 3, 5, 6, 8])
>>> offsets
array([0, 3, 3, 5, 6, 8])
>>> starts, stops = offsets[:-1], offsets[1:]
>>> starts
array([0, 3, 3, 5, 6])
>>> stops
array([3, 3, 5, 6, 8])
You can use these as starts
and stops
, but only if you have some kind of "content" for which the starts
and stops
are completely contiguous. That's not guaranteed for a JaggedArray
's content
, but flatten()
will do that for you:
>>> content = array.flatten()
>>> content
<ChunkedArray [1 2 3 ... 100 200 300] at 0x796aa12b1940>
Now, for instance, the subarray at index 3
is
>>> content[starts[3]:stops[3]]
<ChunkedArray [100] at 0x796a96a3b978>
This is precisely why Awkward 0's ChunkedArray
, JaggedArray
, etc. are going to become internal "for experts only" classes in Awkward 1. The user interface in Awkward 1 will have a single awkward.Array
class with a type
to know if it's jagged or not. Whether it's made out of chunks or something else will be an implementation detail.