1

I have two numpy arrays of shape arr1=(~140000, 3) and arr2=(~450000, 10). The first 3 elements of each row, for both the arrays, are coordinates (z,y,x). I want to find the rows of arr2 that have the same coordinates of arr1 (which can be considered a subgroup of arr2).

for example:

arr1 = [[1,2,3],[1,2,5],[1,7,8],[5,6,7]]

arr2 = [[1,2,3,7,66,4,3,44,8,9],[1,3,9,6,7,8,3,4,5,2],[1,5,8,68,7,8,13,4,53,2],[5,6,7,6,67,8,63,4,5,20], ...]

I want to find common coordinates (same first 3 elements):

list_arr = [[1,2,3,7,66,4,3,44,8,9], [5,6,7,6,67,8,63,4,5,20], ...]

At the moment I'm doing this double loop, which is extremely slow:

list_arr=[]
for i in arr1:
    for j in arr2:
        if i[0]==j[0] and i[1]==j[1] and i[2]==j[2]:
            list_arr.append (j)

I also tried to create (after the 1st loop) a subarray of arr2, filtering it on the value of i[0] (arr2_filt = [el for el in arr2 if el[0]==i[0]). This speed a bit the operation, but it still remains really slow.

Can you help me with this?

Sociopath
  • 13,068
  • 19
  • 47
  • 75
Sav
  • 142
  • 1
  • 17
  • 1
    Could you provide some sample input along with your expected outcome?! Makes it easier to help. – Cleb Nov 19 '18 at 06:38
  • I've added an example, you can check it. – Sav Nov 19 '18 at 07:44
  • You can do this quite the same like https://stackoverflow.com/a/53062049/4045774, but you are creating the tree with arr2[:,0:3] and check against arr1. This can be also beneficial if the points in arr2 are not exactly, but within a small tolerance the same. – max9111 Nov 19 '18 at 08:16

3 Answers3

3

Approach #1

Here's a vectorized one with views -

# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
    a = np.ascontiguousarray(a)
    b = np.ascontiguousarray(b)
    void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
    return a.view(void_dt).ravel(),  b.view(void_dt).ravel()

a,b = view1D(arr1,arr2[:,:3])
out = arr2[np.in1d(b,a)]

Approach #2

Another with dimensionality-reduction for ints -

d = np.maximum(arr2[:,:3].max(0),arr1.max(0))
s = np.r_[1,d[:-1].cumprod()]
a,b = arr1.dot(s),arr2[:,:3].dot(s)
out = arr2[np.in1d(b,a)]

Improvement #1

We could use np.searchsorted to replace np.in1d for both of the approaches listed earlier -

unq_a = np.unique(a)
idx = np.searchsorted(unq_a,b)
idx[idx==len(a)] = 0
out = arr2[unq_a[idx] == b]

Improvement #2

For the last improvement on using np.searchsorted that also uses np.unique, we could use argsort instead -

sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
idx[idx==len(a)] = 0
out = arr2[a[sidx[idx]]==b]
Divakar
  • 218,885
  • 19
  • 262
  • 358
1

You can do it with the help of set

arr = np.array([[1,2,3],[4,5,6],[7,8,9]])
arr2 = np.array([[7,8,9,11,14,34],[23,12,11,10,12,13],[1,2,3,4,5,6]])

# create array from arr2 with only first 3 columns
temp = [i[:3] for i in arr2]

aset = set([tuple(x) for x in arr])
bset = set([tuple(x) for x in temp])
np.array([x for x in aset & bset])

Output

array([[7, 8, 9],
       [1, 2, 3]])

Edit

Use list comprehension

l = [list(i) for i in arr2 if i[:3] in arr]

print(l)

Output:

[[7, 8, 9, 11, 14, 34], [1, 2, 3, 4, 5, 6]]
Sociopath
  • 13,068
  • 19
  • 47
  • 75
  • The output should be the row of arr2 (to append to a list) with all the other information, not only the coordinates. Since arr1 is a subgroup (of the coordinate of interest) of arr2, there is no point in doing this. – Sav Nov 19 '18 at 07:02
  • Thank you for the effort, but it doesn't work with this. It returns all the rows where it founds even 1 of the values present in "i" (and so the other 2 are wrong). – Sav Nov 19 '18 at 07:23
0

For integers Divakar already gave an excellent answer. If you want to compare floats you have to consider e.g. the following:

1.+1e-15==1.
False
1.+1e-16==1.
True

If this behaviour could lead to problems in your code I would recommend to perform a nearest neighbour search and probably check if the distances are within a specified threshold.

import numpy as np
from scipy import spatial
def get_indices_of_nearest_neighbours(arr1,arr2):
  tree=spatial.cKDTree(arr2[:,0:3])
  #You can check here if the distance is small enough and otherwise raise an error
  dist,ind=tree.query(arr1, k=1)
  return ind
max9111
  • 6,272
  • 1
  • 16
  • 33