1

I have a tree with one branch storing a string. When I read using uproot.open() and then the method arrays() I get the following:

>>> array_train['backtracked_end_process']
<ObjectArray [b'FastScintillation' b'FastScintillation' b'FastScintillation' ... b'FastScintillation' b'FastScintillation' b'FastScintillation'] at 0x7f48936e6c90>

I would like to use this branch to create masks, by doing things like array_train['backtracked_end_process'] != b'FastScintillation' but unfortunately this produces an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-97-a28f3706c5b5> in <module>
----> 1 array_train['backtracked_end_process'] == b'FastScintillation'

~/.local/lib/python3.7/site-packages/numpy/lib/mixins.py in func(self, other)
     23         if _disables_array_ufunc(other):
     24             return NotImplemented
---> 25         return ufunc(self, other)
     26     func.__name__ = '__{}__'.format(name)
     27     return func

~/.local/lib/python3.7/site-packages/awkward/array/objects.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    216                 contents.append(x)
    217 
--> 218         result = getattr(ufunc, method)(*contents, **kwargs)
    219 
    220         if self._util_iscomparison(ufunc):

~/.local/lib/python3.7/site-packages/awkward/array/jagged.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    987                 data = self._util_toarray(inputs[i], inputs[i].dtype)
    988                 if starts.shape != data.shape:
--> 989                     raise ValueError("cannot broadcast JaggedArray of shape {0} with array of shape {1}".format(starts.shape, data.shape))
    990 
    991                 if parents is None:

ValueError: cannot broadcast JaggedArray of shape (24035,) with array of shape ()

Does anyone have any suggestion on how to proceed? Being able to transform it to a numpy.chararray would already solve the problem, but I don't know how to do that.

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47

1 Answers1

0

String-handling is a weak point in uproot. It uses a custom ObjectArray (not even the StringArray in awkward-array), which generates bytes objects on demand. What you'd like is an array-of-strings class with == overloaded to mean "compare each variable-length string, broadcasting a single string to an array if necessary." Unfortunately, neither the uproot ObjectArray of strings nor the StringArray class in awkward-array do that yet.

So here's how you can do it, admittedly through an implicit Python for loop.

>>> import uproot, numpy
>>> f = uproot.open("http://scikit-hep.org/uproot/examples/sample-6.10.05-zlib.root")
>>> t = f["sample"]

>>> t["str"].array()
<ObjectArray [b'hey-0' b'hey-1' b'hey-2' ... b'hey-27' b'hey-28' b'hey-29'] at 0x7fe835b54588>

>>> numpy.array(list(t["str"].array()))
array([b'hey-0', b'hey-1', b'hey-2', b'hey-3', b'hey-4', b'hey-5',
       b'hey-6', b'hey-7', b'hey-8', b'hey-9', b'hey-10', b'hey-11',
       b'hey-12', b'hey-13', b'hey-14', b'hey-15', b'hey-16', b'hey-17',
       b'hey-18', b'hey-19', b'hey-20', b'hey-21', b'hey-22', b'hey-23',
       b'hey-24', b'hey-25', b'hey-26', b'hey-27', b'hey-28', b'hey-29'],
      dtype='|S6')

>>> numpy.array(list(t["str"].array())) == b"hey-0"
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False])

The loop is implicit in the list constructor that iterates over the ObjectArray, turning each element into a bytes string. This Python list is not good for array-at-a-time operations, so we then construct a NumPy array, which is (at a cost of padding).

Alternative, probably better:

While writing this, I remembered that uproot's ObjectArray is implemented using an awkward JaggedArray, so the transformation above can be performed with JaggedArray's regular method, which is probably much faster (no intermediate Python bytes objects, no Python for loop).

>>> t["str"].array().regular()
array([b'hey-0', b'hey-1', b'hey-2', b'hey-3', b'hey-4', b'hey-5',
       b'hey-6', b'hey-7', b'hey-8', b'hey-9', b'hey-10', b'hey-11',
       b'hey-12', b'hey-13', b'hey-14', b'hey-15', b'hey-16', b'hey-17',
       b'hey-18', b'hey-19', b'hey-20', b'hey-21', b'hey-22', b'hey-23',
       b'hey-24', b'hey-25', b'hey-26', b'hey-27', b'hey-28', b'hey-29'],
      dtype=object)

>>> t["str"].array().regular() == b"hey-0"
array([ True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False])

(The functionality described above wasn't created intentionally, but it works because the right pieces compose in a fortuitous way.)

Jim Pivarski
  • 5,568
  • 2
  • 35
  • 47
  • The second method works well and it is fast. Do you know if something similar can be used also in case `t["str"].array()` is a `JaggedArray`? Should I transform it in a `numpy` array, perform the comparison, and then build back the `JaggedArray`? – Nicolò Foppiani Nov 18 '19 at 23:52
  • 1
    If it's a jagged array of strings, `regular` will regularize both dimensions. – Jim Pivarski Nov 18 '19 at 23:59