1

I have two 2D numpy arrays shaped:

(19133L, 12L)
(248L, 6L)

In each case, the first 3 fields form an identifier.

I want to reduce the larger matrix so that it only contains rows with identifiers that also exist in the second matrix. So the shape should be (248L, 12L). How can I do this?

I would then like to sort it so that the arrays are indexed by the first value, second value and third value so that (3 3 4) comes after (3 3 5) etc. Is there a multi field sort function?

Edit:

I have tried pandas:

df1 = DataFrame(arr1.astype(str))
df2 = DataFrame(arr2.astype(str))

df1.set_index([0,1,2])
df2.set_index([0,1,2])

out = merge(df1,df2,how="inner") 
print(out.shape)

But this results in (0,13) shape

smci
  • 32,567
  • 20
  • 113
  • 146
user2290362
  • 717
  • 2
  • 7
  • 21
  • 1
    Just to clarify, but wouldn't your reduced matrix not still be of size `248L, 12L` as you're only getting a subset of rows, not also a subset of columns? – Yannick Meeus Mar 16 '15 at 14:44
  • Yes, typo, thank you - will correct – user2290362 Mar 16 '15 at 14:45
  • Is there any order in the smaller array? Otherwise this could get rather costly. Maybe you should instead create a set holding the key-3-tuples. – tobias_k Mar 16 '15 at 14:53
  • No, just use `pandas.set_index()` which allows multiple keys, and is efficient. Don't perform contortions in native numpy. pandas is the answer. – smci Mar 16 '15 at 15:00

1 Answers1

2

Use pandas.

pandas.set_index() allows multiple keys. So set the index to the first three columns (use drop=False, inplace=True) to avoid needlessly mutating or copying your dataframe.

Then, merge(...how='inner') to intersect your dataframes.

In general, numpy runs out of steam very quickly for arbitrary dataframe manipulations; your default thing should be to try pandas. Also much more performant.

smci
  • 32,567
  • 20
  • 113
  • 146