Numpy arrays with compound keys; find subset in both

Question

I have two 2D numpy arrays shaped:

(19133L, 12L)
(248L, 6L)

In each case, the first 3 fields form an identifier.

I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?

I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?

Edit:

I have tried pandas:

df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))

df1.set_index([0,1,2])
df2.set_index([0,1,2])

out = merge(df1,df2,how="inner") 
print(out.shape)

But this results in (0,13) shape

Just to clarify, but wouldn't your reduced matrix not still be of size `248L, 12L` as you're only getting a subset of rows, not also a subset of columns? — Yannick Meeus, Mar 16 '15 at 14:44
Is there any order in the smaller array? Otherwise this could get rather costly. Maybe you should instead create a set holding the key-3-tuples. — tobias_k, Mar 16 '15 at 14:53
No, just use `pandas.set_index()` which allows multiple keys, and is efficient. Don't perform contortions in native numpy. pandas is the answer. — smci, Mar 16 '15 at 15:00

smci · Accepted Answer · 2015-03-18T08:48:10.357

2

Use pandas.

pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.

Then, merge(...how='inner') to intersect your dataframes.

In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.

edited Mar 18 '15 at 08:48

answered Mar 16 '15 at 14:52

smci

32,567
20
113
146

@user2290362: please post reproducible code using a random seed to create data. – smci Mar 16 '15 at 22:51
1

I have fixed that, it needed on=[0,1,2] in the merge – user2290362 Mar 17 '15 at 09:31

Numpy arrays with compound keys; find subset in both

1 Answers1