1

I have a Pandas DataFrame, where columns X1, Y1 have point coordinates for the first group of coordinates and columns X2, Y2 have point coordinates for the second group of coordinates. Both groups are independent of each other. It is just happen to be they are in the same dataframe. Example:

X1,Y1,X2,Y2
41246.438,0.49,38791.673,0.49
41304.5,0.491,38921.557,0.491
41392.062,0.492,39037.135,0.492
41515.5,0.493,39199.972,0.493
41636.062,0.494,39346.561,0.494
41795.188,0.495,39477.63,0.495
42027.75,0.496,39576.275,0.496
42252.25,0.497,39732.102,0.497
42486.812,0.498,39833.753,0.498
42739.062,0.499,39949.13,0.499
43012.125,0.5,40135.42,0.5
43472.75,0.5,40292.017,0.5
43909.562,0.501,40479.452,0.501
44312.625,0.502,40725.329,0.502
44799.938,0.503,40950.05,0.503
45294.938,0.504,41214.136,0.504
45729.625,0.505,41514.213,0.505
45942.438,0.506,41943.208,0.506
46067.688,0.507,42296.643,0.507
46215,0.508,42653.477,0.508
46336.75,0.509,43138.834,0.509
46476.562,0.51,43557.815,0.51
46584.25,0.511,43966.564,0.511
46654.75,0.512,44166.996,0.512
46707.75,0.513,44310.557,0.513
46774.188,0.514,44410.069,0.514
46832.062,0.515,44518.045,0.515
46905.062,0.516,44608.646,0.516
46976.562,0.517,44678.073,0.517
47077.938,0.518,44727.393,0.518
47215.688,0.519,44786.498,0.519
47290.625,0.52,44845.867,0.52
47351.5,0.521,44915.072,0.521

For each point in columns X1, Y1 I need to find a point in column X2, Y2 such that the Euclidean distance between these two points is the shortest.

As an outcome I need to place that found point from columns X2, Y2 in the same row as the corresponding point in X1, Y1. Also I need to augment to the same row the computed shortest Euclidean distance in another column D. Then repeat this process for each point in columns X1, Y1.

One way to do this is to iterate rows in columns X1, Y1, and for each row find shortest Euclidean distance in columns X2, Y2. There are may be better ways to do it without writing for loops.

Stas Buzuluk
  • 794
  • 9
  • 19
Dimon
  • 436
  • 5
  • 15
  • the description sounds like this is an assignment... You should show at least an attempt! – JHBonarius Sep 22 '20 at 19:17
  • It would be great if you can mark an answer as correct so that visitors will see that it's working. If you still need some clarifications - feel free to ask. – Stas Buzuluk Sep 22 '20 at 19:35
  • @JHBonarius. I posted my solution below. This is not an assignment. I just need to show how one distribution is close to or different from another one. – Dimon Sep 22 '20 at 21:07
  • @Stas Buzuluk. There are 3 solutions so far and all seems to be correct. So I don't know which one to mark. – Dimon Sep 22 '20 at 21:08
  • @Dimon marking answer is needed for other people so that they can find the most helpful answer first and be sure, that it's correct, you usually want to do that for every question you ask. So pick the most helpful. You may also mark your answer to get some reputation, it's totally fine, especially if you can use information from other answers. – Stas Buzuluk Sep 23 '20 at 07:30
  • That is a typical example for a kdtree. https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html This will be much faster than any method which calculates all distances and searches for the minimal distance. – max9111 Sep 23 '20 at 07:43

3 Answers3

1

Solution


Use Faiss.

pip install faiss

Instead of IndexFlatL2 you can use slightly faster IndexIVFFlat that allows you to approximate results.

import faiss
def get_closest(df: pd.DataFrame)->pd.DataFrame:
    d = 2 #  dimensionality

    xb = np.float32(df[["X2","Y2"]].values)
    xb = np.ascontiguousarray(xb)
    
    xq = np.float32(df[["X1","Y1"]].values)
    xq = np.ascontiguousarray(xq)

    index = faiss.IndexFlatL2(d) #  build the index
    index.add(xb)                #  add vectors to the index
    
    D, I = index.search(xq, 1)     # actual search
    
    res_df = df[["X1","Y1"]]
    res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[I[:,0]].reset_index(drop = True)
    res_df["distance"] = D[:,0]
    return res_df

get_closest(df)

Performance


For 1e4 (x,y) pairs in both sets - running time:

371 ms ± 58.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 1e5 vectors

33.9 s ± 3.55 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

That should be similar to generating full distances matrix, using scipy, or NumPy, but it's much more efficient in terms of memory usage, and do not require a further search on this matrix.

Notice


  1. In the function above - for res_df I'm setting it to be a slice of df that's not recommended since changes you are making in res_df will affect df. That's made for lower memory usage, if you want to avoid unpredictable behaviour you can make a copy.
  2. In case if you need more than 1 neighbour for every point - it's very easy to achieve with faiss with minimum modifications.

Alternatives


Using KDTree

import pandas as pd
from scipy.spatial import KDTree
def get_closest(df: pd.DataFrame)->pd.DataFrame:
    tree = KDTree(df[["X1", "Y1"]].values) 
    dist, ind = tree.query(df[["X2", "Y2"]].values, k=1) # k desired number of neighbors 
    res_df = df[["X1","Y1"]]
    res_df[["X2","Y2"]] = df[["X2","Y2"]].iloc[ind].reset_index(drop = True)
    res_df["distance"] = dist
    return res_df
get_closest(df)

For 1e4 (x,y) pairs in both sets - running time:

1.43 s ± 55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 1e5 (x,y) pairs in both sets - running time:

17 s ± 767 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using cdist, proposed by @Dimon

df[['X2','Y2']] = \
  df[['X2','Y2']].iloc[np.argmin(cdist(df[['X1','Y1']], df[['X2','Y2']],
  metric='euclidean' ), axis=1),:].copy().reset_index(drop=True)
df['D'] = np.linalg.norm(df[['X1','Y1']].values - df[['X2','Y2']].values, axis=1)

For 1e4 (x,y) pairs in both sets - running time:

543 ms ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 1e5 (x,y) pairs in both sets - running time:

MemoryError: Unable to allocate 74.5 GiB for an array with shape (100000, 100000) and data type float64

Using numpy, as proposed by @Valdi_Bo

diffs = df.iloc[:, 2:].values[np.newaxis, :, :]\
    - df.iloc[:, :2].values[:, np.newaxis, :]
diffs2 = (diffs ** 2).sum(axis=2)
result = pd.Series(np.sqrt(diffs2.min(axis=0)), name='minDist')
diffs2.argmin(axis=0)

For 1e4 (x,y) pairs in both sets - running time:

1.6 s ± 82.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For 1e5 (x,y) pairs in both sets - running time:

MemoryError: Unable to allocate 149. GiB for an array with shape (100000, 100000, 2) and data type float64
Stas Buzuluk
  • 794
  • 9
  • 19
  • Maybe something could be done with scipy.spatial.distance_matrix: https://stackoverflow.com/questions/58768373/fastest-way-to-calculate-the-shortest-euclidean-distance-between-points-in-pa so to eliminate that 'for' loop? – Dimon Sep 22 '20 at 17:57
  • The for loop used here for DataFrame construction, not calculating distances. It should be pretty fast and definitely more memory eficient, I've used a similar approach for 1e6 vectors in both sets. – Stas Buzuluk Sep 22 '20 at 18:21
  • I've updated answer so that now there's no loop there. – Stas Buzuluk Sep 22 '20 at 18:42
1

I will show you how to compute the result based solely on Numpy.

The first step is to compute differences along each coordinate, between each "X1 / Y1" point and each "X2 / Y2" point:

diffs = df.iloc[:, 2:].values[np.newaxis, :, :]\
    - df.iloc[:, :2].values[:, np.newaxis, :]

Then compute squares of these differences and sum them (for each pair of points):

diffs2 = (diffs ** 2).sum(axis=2)

And the last step is to compute the result by:

  • finding minimal square distance from each "X2 / Y2" point,
  • compute the root from it (for each point),
  • convert to a Series.

The code to do it is:

result = pd.Series(np.sqrt(diffs2.min(axis=0)), name='minDist')

Additionally, if you want to know which "X1 / Y1" point is the closest to the given "X2 / Y2" point, run:

diffs2.argmin(axis=0)

For your data it is:

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  3,
        6,  7,  9, 10, 11, 12, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14],
      dtype=int64)
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
1

Here is alternative solution based on cdist:

from io import StringIO
import pandas as pd
import numpy as np
from scipy.spatial.distance import cdist

df = \
'''
X1,Y1,X2,Y2
41246.438,0.49,38791.673,0.49
41304.5,0.491,38921.557,0.491
41392.062,0.492,39037.135,0.492
41515.5,0.493,39199.972,0.493
41636.062,0.494,39346.561,0.494
41795.188,0.495,39477.63,0.495
42027.75,0.496,39576.275,0.496
42252.25,0.497,39732.102,0.497
42486.812,0.498,39833.753,0.498
42739.062,0.499,39949.13,0.499
43012.125,0.5,40135.42,0.5
43472.75,0.5,40292.017,0.5
43909.562,0.501,40479.452,0.501
44312.625,0.502,40725.329,0.502
44799.938,0.503,40950.05,0.503
45294.938,0.504,41214.136,0.504
45729.625,0.505,41514.213,0.505
45942.438,0.506,41943.208,0.506
46067.688,0.507,42296.643,0.507
46215,0.508,42653.477,0.508
46336.75,0.509,43138.834,0.509
46476.562,0.51,43557.815,0.51
46584.25,0.511,43966.564,0.511
46654.75,0.512,44166.996,0.512
46707.75,0.513,44310.557,0.513
46774.188,0.514,44410.069,0.514
46832.062,0.515,44518.045,0.515
46905.062,0.516,44608.646,0.516
46976.562,0.517,44678.073,0.517
47077.938,0.518,44727.393,0.518
47215.688,0.519,44786.498,0.519
47290.625,0.52,44845.867,0.52
47351.5,0.521,44915.072,0.521
'''

df = pd.read_csv(StringIO(df), sep=",")
print(df)

df[['X2','Y2']] = \
  df[['X2','Y2']].iloc[np.argmin(cdist(df[['X1','Y1']], df[['X2','Y2']],
  metric='euclidean' ), axis=1),:].copy().reset_index(drop=True)
df['D'] = np.linalg.norm(df[['X1','Y1']].values - df[['X2','Y2']].values, axis=1)
print(df)
Dimon
  • 436
  • 5
  • 15