How to join numpy structured array with duplicates

Question

R user here, and I'm attempting my first project in Python to take advantage of Numba. I've read that Numba works very well with Numpy but not well with Pandas, so I am attempting to avoid Pandas. My current question actually has nothing to do with Numba, but I wanted to mention it as my reason for avoiding Pandas.

I have two Numpy structured arrays, one with many duplicates. I am attempting to join them using the "numpy.lib.recfunctions.join_by" function, but the documentation explicitly states that duplicates cause problems. Can anybody recommend any workarounds for all my duplicates?

Here is an example similar to my situation:

import numpy as np
import numpy.lib.recfunctions as rfn

a = np.zeros(4, dtype={'names':('name', 'age'),
                       'formats':('U10','f8')})
a['name'] = ['Alex', 'Billy', 'Charlie', 'Dave']
a['age'] = [25, 25, 75, 75]

b = np.zeros(2, dtype={'names':('age', 'senior'),
                       'formats':('f8', 'i4')})
b['age'] = [25, 75]
b['senior'] = [0, 1]

c = rfn.join_by('age', a, b, jointype='leftouter', usemask=False)

print(c)
[(25., 'Alex',      0) (75., 'Billy',      1) (75., 'Charlie', 999999)
(75., 'Dave', 999999)]

This (1) changes the "age" of Billy from 25 to 75 and (2) gives a "senior" value of 999999 for Charlie & Dave.

Does anybody have a workaround for the duplicates restriction of this function? Thanks in advance.

Both Pandas and Numpy uses fast compiled code underneath. You can really use Pandas because of this, numba will not provide a significant speedup above Pandas. But then again to measure is to know. — Jurgen Strydom, Mar 21 '19 at 04:00
I hear that this is easier with Pandas, but like I mentioned, I'd really like to find a workaround in Numpy to take advantage of Numba (which supposedly doesn't work well with Pandas) — Frank, Mar 21 '19 at 04:01
Possible duplicate of https://stackoverflow.com/questions/53257916/numpy-how-to-left-join-arrays-with-duplicates . In any case, the answer there can be modified slightly to answer this. — Anuj Kumar, Feb 07 '20 at 23:16

score 0 · Answer 1 · answered Mar 21 '19 at 04:34

Why instead of joining not do a comparison? This works much better in your example.

I realize this wont work for arbitrary joins where you have a set of keys that have to map to values. There I recommend you loop over the keys and build up the array from scratch starting with an empty array filled with NaNs, and using np.where to find and replace the values in the array.

Using this starting code:

import numpy as np
import numpy.lib.recfunctions as rfn

a = np.zeros(4, dtype={'names':('name', 'age'),
                       'formats':('U10','f8')})
a['name'] = ['Alex', 'Billy', 'Charlie', 'Dave']
a['age'] = [25, 25, 75, 75]

you can do:

d = rfn.append_fields(a, names='senior', data=(a['age'] >= 65).astype(int))
print(d)

which results in:

[('Alex', 25.0, 0) ('Billy', 25.0, 0) ('Charlie', 75.0, 1) ('Dave', 75.0, 1)]

The main reason for using Numba is to speed up python code. Numpy and Pandas already have these speedups below the hood.

score 0 · Answer 2 · answered Mar 21 '19 at 05:59

Under the cover the recfunctions usually construct a new dtype, and 'blank' result array. Then they copy values by field name. I haven't studied the join_by, but I can imagine your join being like this:

In [11]: a.dtype                                                                          
Out[11]: dtype([('name', '<U10'), ('age', '<f8')])
In [12]: b.dtype                                                                          
Out[12]: dtype([('age', '<f8'), ('senior', '<i4')])
In [13]: b.dtype[1]                                                                       
Out[13]: dtype('int32')
In [14]: b.dtype.descr                                                                    
Out[14]: [('age', '<f8'), ('senior', '<i4')]

In [16]: dt = np.dtype(a.dtype.descr+[b.dtype.descr[1]])                                  
In [17]: dt                                                                               
Out[17]: dtype([('name', '<U10'), ('age', '<f8'), ('senior', '<i4')])

In [18]: e = np.zeros(a.shape, dt)                                                        
In [19]: for name in a.dtype.names: 
    ...:     e[name] = a[name] 
    ...:                                                                                  

In [21]: e                                                                                
Out[21]: 
array([('Alex', 25., 0), ('Billy', 25., 0), ('Charlie', 75., 0),
       ('Dave', 75., 0)],
      dtype=[('name', '<U10'), ('age', '<f8'), ('senior', '<i4')])

With a bit trial and error I found this way of pairing the b ages with the a (now e) ones:

In [23]: e['age'][:,None]==b['age']                                                       
Out[23]: 
array([[ True, False],
       [ True, False],
       [False,  True],
       [False,  True]])
In [25]: np.where(Out[23])                                                                
Out[25]: (array([0, 1, 2, 3]), array([0, 0, 1, 1]))

Now just copy the corresponding 'senior' values from b to e:

In [27]: e['senior'][Out[25][0]] = b['senior'][Out[25][1]]                                
In [28]: e                                                                                
Out[28]: 
array([('Alex', 25., 0), ('Billy', 25., 0), ('Charlie', 75., 1),
       ('Dave', 75., 1)],
      dtype=[('name', '<U10'), ('age', '<f8'), ('senior', '<i4')])

The underlying logic does not depend on these being structured arrays. We just as well have individual 1d arrays of names, ages, senior_category_age, etc.

The recfunctions don't get a lot of use - as evident from the separate packaging, and from limited SO questions. However recent changes in multifield indexing will, I think, increase it use, at least for the newly added functions.

https://docs.scipy.org/doc/numpy/user/basics.rec.html#accessing-multiple-fields

How to join numpy structured array with duplicates

2 Answers2