1

I have a dataset of 2D audio data. These audio fragments differ in length, hence I'm using Awkward Array. Through a Boolean mask, I want to only return the parts containing speech.

Table mask attempt

import numpy as np
import awkward as aw

awk = aw.fromiter([{"ch0": np.array([0, 1, 2]), "ch1": np.array([3, 4, 5])},
                   {"ch0": np.array([6, .7]), "ch1": np.array([8, 9])}])
# [{'ch0': [0.0, 1.0, 2.0], 'ch1': [3, 4, 5]},
#  {'ch0': [6.0, 0.7], 'ch1': [8, 9]}]

awk_mask = aw.fromiter([{"op": np.array([False, True, False]), "cl": np.array([True, True, False])},
                        {"op": np.array([True, True]), "cl": np.array([True, False])}])
# [{'cl': [True, True, False], 'op': [False, True, False]},
#  {'cl': [True, False], 'op': [True, True]}]

awk[awk_mask]
# TypeError: cannot interpret dtype [('cl', 'O'), ('op', 'O')] as a fancy index or mask

It seems that a Table cannot be used for fancy indexing.

Array mask attempts

Numpy equivalent

nparr = np.arange(0,6).reshape((2, -1))
# array([[0, 1, 2],
#        [3, 4, 5]])

npmask = np.array([True, False, True])
nparr[:, npmask]
# array([[0, 2],
#        [3, 5]])

Table version attempt; failed

awk[:, npmask]
# NotImplementedError: multidimensional index through a Table (TODO: needed for [0, n) -> [0, m) -> "table" -> ...)

Seems multidimensional selection is not implemented yet.

JaggedArray - Numpy mask version; works

jarr = aw.fromiter(nparr)
# <JaggedArray [[0 1 2] [3 4 5]] at 0x..>

jarr[:npmask]
# array([[0, 2],
#        [3, 5]])

JaggedArray - JaggedArray mask version; works

jmask = aw.fromiter(npmask)
# array([ True, False,  True])

jarr[:, jmask]
# array([[0, 2],
#        [3, 5]])

Questions

  • How to do efficient boolean mask selection with Table or with named dimensions (like xarray)?
  • Will multidimensional selection in Table be implemented in awkward-array, or only in awkward-1.0?

Library versions

print("numpy version  : ", np.__version__)  # numpy version  :  1.17.3
print("pandas version : ", pd.__version__)  # pandas version :  0.25.3
print("awkward version : ", aw.__version__)  # awkward version :  0.12.14
NumesSanguis
  • 5,832
  • 6
  • 41
  • 76

1 Answers1

1

This is not with named array dimensions, but with only JaggedArrays, masked selection is possible:

jarr_2d = aw.fromiter([[np.array([0, 1, 2]), np.array([3, 4, 5])],
                       [np.array([6, 7]), np.array([8, 9])]])
# <JaggedArray [[[0 1 2] [3 4 5]] [[6 7] [8 9]]] at 0x7fc9c7c4e750>

jarr_2d_mask = aw.fromiter([[np.array([False, True, False]), np.array([True, True, False])],
                            [np.array([True, True]), np.array([True, False])]])
# <JaggedArray [[[False True False] [True True False]] [[True True] [True False]]] at 0x7fc9c7c1e590>


jarr_2d[jarr_2d_mask]
# <JaggedArray [[[1] [3 4]] [[6 7] [8]]] at 0x7fc9c7c5b690>

Not sure if this code is efficient? Especially compared to fancy indexing with only Numpy arrays?

NumesSanguis
  • 5,832
  • 6
  • 41
  • 76
  • `fromiter` should be considered inefficient in Awkward 0.x (but not in Awkward 1.x). – Jim Pivarski Dec 05 '19 at 18:31
  • What would be the efficient way of doing this in Awkward 0.x @JimPivarski ? – NumesSanguis Dec 06 '19 at 00:28
  • Also, this doesn't solve the initial question, as it doesn't work on `Table` arrays. – NumesSanguis Dec 06 '19 at 05:44
  • (I used a comment, rather than an answer, because I didn't have time to fully answer the question but wanted to give some information.) – Jim Pivarski Dec 06 '19 at 20:18
  • (Since I posted quite a long question, it's quite common to not read everything in detail. Therefore, I wanted to attend with my comment that this is not a final answer. @JimPivarski Thanks for providing even a little bit of extra information, and indeed this .fromiter() is slow for big arrays. Please let me know if there is a more efficient way, if you have time to answer.) – NumesSanguis Dec 10 '19 at 07:21