Efficient method to replace values in awkward array according to a dictionary?

Question

I have a dictionary with integer keys and float values. I also have a 2D awkward array with integer entries (I'm using awkward1). I want to replace these integers with the corresponding float according to the dictionary, keeping the awkward array format.

Assuming the keys run from 0 to 999, my solution so far is something like this:

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,1000):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

Is there a faster way to do this?

Update

Minimal reproducible example of my working code:

import awkward as ak # Awkward 1

myArray = ak.from_iter([[0, 1], [2, 1, 0]]) # Creating example array
myDict = {0: 19.5, 1: 34.1, 2: 10.9}

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,3):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

myArray:

<Array [[0, 1], [2, 1, 0]] type='2 * var * int64'>

resultArray:

<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

Please make a [mcve] with appropriate imports, sample input, and desired vs. actual output. — Mark Tolonen, Jan 15 '21 at 01:16
I don't have time for a full answer, but you might want to consider using `np.searchsorted` as described here: https://github.com/scikit-hep/awkward-1.0/discussions/633 Replace your dict with two NumPy arrays, one with sorted keys and the other with values in the same order. You'll need to flatten/unflatten the Awkward Array, or otherwise extract the one-dimensional arrays from it. An `ak.searchsorted` to do this automatically works be nice, but a convenience function like that does not exist. — Jim Pivarski, Jan 15 '21 at 01:49

score 1 · Accepted Answer · answered Jan 15 '21 at 18:30

When I mentioned in a comment that np.searchsorted is where you should be looking, I hadn't noticed that myDict includes every consecutive integer as a key. Having a dense lookup table like this would allow faster algorithms, which also happen to be simpler in Awkward Array.

So, assuming that there's a key in myDict for each integer from 0 up to some value, you can equally well represent the lookup table as

>>> lookup = ak.Array([myDict[i] for i in range(len(myDict))])
>>> lookup
<Array [19.5, 34.1, 10.9] type='3 * float64'>

The problem of picking values at 0, 1, and 2 becomes just an array-slice. (This array-slice is an O(n) algorithm for array length n, unlike np.searchsorted, which would be O(n log n). That's the cost of having sparse lookup keys.)

The problem, however, is that myArray is nested and lookup is not. We can give lookup the same depth as myArray by slicing it up:

>>> multilookup = lookup[np.newaxis][np.zeros(len(myArray), np.int64)]
>>> multilookup
<Array [[19.5, 34.1, 10.9, ... 34.1, 10.9]] type='2 * 3 * float64'>
>>> multilookup.tolist()
[[19.5, 34.1, 10.9], [19.5, 34.1, 10.9]]

And then multilookup[myArray] is exactly what you want:

>>> multilookup[myArray]
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

The lookup had to be duplicated because each list within myArray uses global indexes in the whole lookup. If the memory involved in creating multilookup is prohibitive, you could instead break myArray down to match it:

>>> flattened, num = ak.flatten(myArray), ak.num(myArray)
>>> flattened
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> num
<Array [2, 3] type='2 * int64'>
>>> lookup[flattened]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(lookup[flattened], nums)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

If your keys are not dense from 0 up to some integer, then you'll have to use np.searchsorted:

>>> keys = ak.Array(myDict.keys())
>>> values = ak.Array([myDict[key] for key in keys])
>>> keys
<Array [0, 1, 2] type='3 * int64'>
>>> values
<Array [19.5, 34.1, 10.9] type='3 * float64'>

In this case, the keys are trivial because it is dense. When using np.searchsorted, you have to explicitly cast the flat Awkward Arrays as NumPy (for now; we're looking to fix that).

>>> lookup_index = np.searchsorted(np.asarray(keys), np.asarray(flattened), side="left")
>>> lookup_index
array([0, 1, 2, 1, 0])

Then we pass it through the trivial keys (which doesn't change it, in this case) before passing it to the values.

>>> keys[lookup_index]
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> values[keys[lookup_index]]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(values[keys[lookup_index]], num)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

But the thing I was waffling about in yesterday's comment was that you have to do this on the flattened form of myArray (flattened) and reintroduce the structure later ak.unflatten, as above. But perhaps we should wrap np.searchsorted as ak.searchsorted to recognize a fully structured Awkward Array in the second argument, at least. (It has to be unstructured to be in the first argument.)

Thanks for the detailed explanation. The case where the integer keys are not dense is useful for me as well, and I believe that `values[lookup_index]` is what is needed for this case, since `keys[lookup_index]` will not index the `values` array properly since the keys are arbitrary. — Chami Sangeeth Amarasinghe, Jan 15 '21 at 22:25

Efficient method to replace values in awkward array according to a dictionary?

1 Answers1