R user here, and I'm attempting my first project in Python to take advantage of Numba. I've read that Numba works very well with Numpy but not well with Pandas, so I am attempting to avoid Pandas. My current question actually has nothing to do with Numba, but I wanted to mention it as my reason for avoiding Pandas.
I have two Numpy structured arrays, one with many duplicates. I am attempting to join them using the "numpy.lib.recfunctions.join_by" function, but the documentation explicitly states that duplicates cause problems. Can anybody recommend any workarounds for all my duplicates?
Here is an example similar to my situation:
import numpy as np
import numpy.lib.recfunctions as rfn
a = np.zeros(4, dtype={'names':('name', 'age'),
'formats':('U10','f8')})
a['name'] = ['Alex', 'Billy', 'Charlie', 'Dave']
a['age'] = [25, 25, 75, 75]
b = np.zeros(2, dtype={'names':('age', 'senior'),
'formats':('f8', 'i4')})
b['age'] = [25, 75]
b['senior'] = [0, 1]
c = rfn.join_by('age', a, b, jointype='leftouter', usemask=False)
print(c)
[(25., 'Alex', 0) (75., 'Billy', 1) (75., 'Charlie', 999999)
(75., 'Dave', 999999)]
This (1) changes the "age" of Billy from 25 to 75 and (2) gives a "senior" value of 999999 for Charlie & Dave.
Does anybody have a workaround for the duplicates restriction of this function? Thanks in advance.