Compute distances between 2 dataframes based on boolean matrix as a mask

Question

I have 2 dataframes where columns are features and rows are different items.

import pandas as pd 
import numpy as np
import random

random.seed(0) 
data1 = {'x':random.sample(range(1,100), 4), 'y':random.sample(range(1,100), 4), 
'size':random.sample(range(1,20), 4), 'weight':random.sample(range(1,20), 4), 
'volume':random.sample(range(1,50), 4)} 

data2 = {'x':random.sample(range(1,100), 6), 'y':random.sample(range(1,100), 6), 
'size':random.sample(range(1,10), 6), 'weight':random.sample(range(1,10), 6), 
'volume':random.sample(range(1,20), 6)}   

df1 = pd.DataFrame(data1) 
df2 = pd.DataFrame(data2)

Here I need to create a mask. I will compute distances only between items for which df1['size'] > df2['size'] & df1['weight'] > df2['weight'] & df1['volume'] > df2['volume']. This would give here a (6,4) boolean array.

Then, I need to compute the Euclidean distance between items of df1 and items of df2 where the condition above is True. For the False cases, no need to compute the distance and +Inf can be put instead in the array.

My intuition is to use numpy broadcast and np.einsum for the distance because this should be the fastest. Runtime is priority 1.

Thanks for your time and help.

Example: df1 =

x   y   size    weight  volume
50  34  10      17      49
98  66  16       5       7
54  63  12      10      40
 6  52   7      18      17

df2 =

x   y   size    weight  volume
69  94  2       9       18
91  10  6       8        1
78  88  4       4        3
19  43  3       5       13
40  61  5       3       12
13  72  9       1       14

The first step (that does not have to be explicit) is to build the mask based on size, weight, and volume being greater in df1:

      df2.0   df2.1   df2.2   df2.3   df2.4   df2.5
df1.0     1       1       1       1       1       1
df1.1     0       0       1       0       0       0
df1.2     1       1       1       1       1       1
df1.3     0       1       1       1       1       0

The final result expected is then:

      df2.0   df2.1   df2.2   df2.3   df2.4   df2.5
df1.0 62.94   47.51   60.83   32.28   28.79   53.04
df1.1   Inf     Inf   24.17     Inf     Inf     Inf
df1.2 48.27   64.64   34.66   40.31   14.14   41.98
df1.3   Inf   94.81   80.50   15.81   35.17     Inf

score 0 · Answer 1 · answered Nov 18 '20 at 02:15

Is this something you are looking for ?

for i in range (len(df2)-len(df1)): 
      df1=df1.append(pd.Series(), ignore_index=True)  # Making the df1 & 2 identical shape
dft1=df1[np.logical_and(np.logical_and(df1['size']>df2['size'],df1['weight']>df2['weight']),df1['volume']>df2['volume'])]
dft2=df2[np.logical_and(np.logical_and(df1['size']>df2['size'],df1['weight']>df2['weight']),df1['volume']>df2['volume'])]
print(np.linalg.norm(dft1 - dft2))

output

90.39358384310249

thanks. not exactly what I envisioned. I will edit my question. — user2590177, Nov 24 '20 at 01:04

Compute distances between 2 dataframes based on boolean matrix as a mask

1 Answers1