I have 2 dataframes where columns are features and rows are different items.
import pandas as pd
import numpy as np
import random
random.seed(0)
data1 = {'x':random.sample(range(1,100), 4), 'y':random.sample(range(1,100), 4),
'size':random.sample(range(1,20), 4), 'weight':random.sample(range(1,20), 4),
'volume':random.sample(range(1,50), 4)}
data2 = {'x':random.sample(range(1,100), 6), 'y':random.sample(range(1,100), 6),
'size':random.sample(range(1,10), 6), 'weight':random.sample(range(1,10), 6),
'volume':random.sample(range(1,20), 6)}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Here I need to create a mask. I will compute distances only between items for which df1['size'] > df2['size'] & df1['weight'] > df2['weight'] & df1['volume'] > df2['volume']. This would give here a (6,4) boolean array.
Then, I need to compute the Euclidean distance between items of df1 and items of df2 where the condition above is True. For the False cases, no need to compute the distance and +Inf can be put instead in the array.
My intuition is to use numpy broadcast and np.einsum for the distance because this should be the fastest. Runtime is priority 1.
Thanks for your time and help.
Example: df1 =
x y size weight volume
50 34 10 17 49
98 66 16 5 7
54 63 12 10 40
6 52 7 18 17
df2 =
x y size weight volume
69 94 2 9 18
91 10 6 8 1
78 88 4 4 3
19 43 3 5 13
40 61 5 3 12
13 72 9 1 14
The first step (that does not have to be explicit) is to build the mask based on size, weight, and volume being greater in df1:
df2.0 df2.1 df2.2 df2.3 df2.4 df2.5
df1.0 1 1 1 1 1 1
df1.1 0 0 1 0 0 0
df1.2 1 1 1 1 1 1
df1.3 0 1 1 1 1 0
The final result expected is then:
df2.0 df2.1 df2.2 df2.3 df2.4 df2.5
df1.0 62.94 47.51 60.83 32.28 28.79 53.04
df1.1 Inf Inf 24.17 Inf Inf Inf
df1.2 48.27 64.64 34.66 40.31 14.14 41.98
df1.3 Inf 94.81 80.50 15.81 35.17 Inf