1

I'd like to left outer join two recarrays. The first is a list of entities with a unique key. The second is a list of values, and there can be 0 or more values per entity. My environment requires that I use Python 2.7 and I'm not able to use Pandas.

This question has been asked before here but there was not a good answer.

    import numpy as np
    import numpy.lib.recfunctions
    from pprint import pprint

    dtypes = [('point_index',int),('name','S50')] 
    recs = [(0,'Bob'),
            (1,'Bob'),
            (2,'Sue'),
            (3,'Sue'),
            (4,'Jim')]
    x = np.rec.fromrecords(recs,dtype=dtypes)

    dtypes = [('point_index',int),('type','S500'),('value',float)] 
    recs = [(0,'a',0.1),
            (0,'b',0.2),
            (1,'a',0.3),
            (2,'b',0.4),
            (2,'b',0.5),
            (4,'a',0.6),
            (4,'a',0.7),
            (4,'a',0.8)]
    y = np.rec.fromrecords(recs,dtype=dtypes)

    j = np.lib.recfunctions.join_by('point_index',x,y,jointype='leftouter',usemask=False,asrecarray=True)

    pprint(j.tolist())

I want

# [(0,'Bob','a',0.1),
#  (0,'Bob','b',0.2),
#  (1,'Bob','a',0.3),
#  (2,'Sue','b',0.4),
#  (2,'Sue','b',0.5),
#  (4,'Jim','a',0.6),
#  (4,'Jim','a',0.7),
#  (4,'Jim','a',0.8)]

But I get

[(0, 'Bob', 'a', 0.1),
 (0, 'Bob', 'b', 0.2),
 (1, 'Sue', 'a', 0.3),
 (2, 'Jim', 'b', 0.4),
 (2, 'N/A', 'b', 0.5),
 (3, 'Sue', 'N/A', 1e+20),
 (4, 'N/A', 'a', 0.6),
 (4, 'N/A', 'a', 0.7),
(4, 'N/A', 'a', 0.8)]

I know why, this is from the docs

Neither r1 nor r2 should have any duplicates along key: the presence of duplicates will make the output quite unreliable. Note that duplicates are not looked for by the algorithm.

So, it seems like this requirement really limits the usefulness of this function. It seems like the type of left outer join I describe is a really common operation, does anybody know how to achieve it using numpy?

Community
  • 1
  • 1
Ben Carlson
  • 1,053
  • 2
  • 10
  • 18
  • `join_by` is a long Python function, with lots of concatenates, sorts, etc. Since it isn't compiled, it isn't going to be any faster than code that you could write yourself. My first thought is to collect information by key in a defaultdict, and build a structured array from that. – hpaulj Aug 02 '15 at 18:25
  • Are the `pont_index` values in `x` always sorted and contiguous, ie. always [0,1,2...]? – hpaulj Aug 02 '15 at 18:44
  • hpaulj - yes to both! – Ben Carlson Aug 02 '15 at 18:48

1 Answers1

3

If the point_index values of x are in numeric order, you could match them with y by simple indexing.

One way is to construct a new array, z, with the added names field. Here I'm using structured arrays (rec would also work, but I don't need its extra functionality):

In [419]: dtypes1 = [('point_index',int),('name','S50')]    
In [420]: dtypes
Out[420]: [('point_index', int), ('type', 'S500'), ('value', float)]
In [421]: dtypes2=dtypes1 + dtypes[1:]
In [422]: z=np.zeros(y.shape[0],dtype=dtypes2)

Fill z with matching fields from y:

In [423]: for n in y.dtype.names:
    z[n] = y[n]

Since the number of fields are usually much smaller than the number of rows, this sort of copy is not expensive.

Select names by simple indexing:

In [424]: z['name']=x['name'][y['point_index']]

In [425]: z
Out[425]: 
array([(0, b'Bob', b'a', 0.1), (0, b'Bob', b'b', 0.2),
       (1, b'Bob', b'a', 0.3), (2, b'Sue', b'b', 0.4),
       (2, b'Sue', b'b', 0.5), (4, b'Jim', b'a', 0.6),
       (4, b'Jim', b'a', 0.7), (4, b'Jim', b'a', 0.8)], 
      dtype=[('point_index', '<i4'), ('name', 'S50'), ('type', 'S500'), ('value', '<f8')])

There are more general ways of matching x['point_index'] and y['point_index']. Just treat them like two numeric arrays that need matching (possibly with unique and sort). Or even use a list comprehension and find.

Or using the append_fields method in your linked answer:

In [441]: import numpy.lib.recfunctions as nrec

In [442]: names=x['name'][y['point_index']]

In [443]: nrec.append_fields(y, 'name', names,
                         asrecarray=False, usemask=False)
Out[443]: 
array([(0, b'a', 0.1, b'Bob'), (0, b'b', 0.2, b'Bob'),
       (1, b'a', 0.3, b'Bob'), (2, b'b', 0.4, b'Sue'),
       (2, b'b', 0.5, b'Sue'), (4, b'a', 0.6, b'Jim'),
       (4, b'a', 0.7, b'Jim'), (4, b'a', 0.8, b'Jim')], 
      dtype=[('point_index', '<i4'), ('type', 'S500'), ('value', '<f8'), ('name', 'S50')])

append_fields does roughly what I wrote earlier - create an output with a new dtype, and then fills in the values from the base and the new data. It uses recursive_fill_fields to copy the data, which for a simple dtype does the same copy-by-names.

nrec.recursive_fill_fields(y,z)
hpaulj
  • 221,503
  • 14
  • 230
  • 353