I'd like to left outer join two recarrays. The first is a list of entities with a unique key. The second is a list of values, and there can be 0 or more values per entity. My environment requires that I use Python 2.7 and I'm not able to use Pandas.
This question has been asked before here but there was not a good answer.
import numpy as np
import numpy.lib.recfunctions
from pprint import pprint
dtypes = [('point_index',int),('name','S50')]
recs = [(0,'Bob'),
(1,'Bob'),
(2,'Sue'),
(3,'Sue'),
(4,'Jim')]
x = np.rec.fromrecords(recs,dtype=dtypes)
dtypes = [('point_index',int),('type','S500'),('value',float)]
recs = [(0,'a',0.1),
(0,'b',0.2),
(1,'a',0.3),
(2,'b',0.4),
(2,'b',0.5),
(4,'a',0.6),
(4,'a',0.7),
(4,'a',0.8)]
y = np.rec.fromrecords(recs,dtype=dtypes)
j = np.lib.recfunctions.join_by('point_index',x,y,jointype='leftouter',usemask=False,asrecarray=True)
pprint(j.tolist())
I want
# [(0,'Bob','a',0.1),
# (0,'Bob','b',0.2),
# (1,'Bob','a',0.3),
# (2,'Sue','b',0.4),
# (2,'Sue','b',0.5),
# (4,'Jim','a',0.6),
# (4,'Jim','a',0.7),
# (4,'Jim','a',0.8)]
But I get
[(0, 'Bob', 'a', 0.1),
(0, 'Bob', 'b', 0.2),
(1, 'Sue', 'a', 0.3),
(2, 'Jim', 'b', 0.4),
(2, 'N/A', 'b', 0.5),
(3, 'Sue', 'N/A', 1e+20),
(4, 'N/A', 'a', 0.6),
(4, 'N/A', 'a', 0.7),
(4, 'N/A', 'a', 0.8)]
I know why, this is from the docs
Neither
r1
norr2
should have any duplicates alongkey
: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.
So, it seems like this requirement really limits the usefulness of this function. It seems like the type of left outer join I describe is a really common operation, does anybody know how to achieve it using numpy?