How can I speed up my 3D Euclidean distance matrix code

Question

I have created code to calculate the distance of all objects (tagID) from one another based on x, y, z coordinates (TX, TY, TZ) at each time step (Frame). While this code does work, it is too slow for what I need. My current test data, has about 538,792 rows of data, my actual data will be about 6,880,000 lines of data. Currently it takes a few minutes (maybe 10-15) to make these distance matrices, and since I will have 40 sets of data, I woud like to speed thigs up.

The current code is as follows:

# Sample data frame with correct columns:

data2 = ({'Frame' :[1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7], 
      'tagID' : ['nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3'],
      'TX':[5,2,3,4,5,6,7,5,np.nan,5,2,3,4,5,6,7,5,4,8,3,2],
      'TY':[4,2,3,4,5,9,3,2,np.nan,5,2,3,4,5,6,7,5,4,8,3,2],
      'TZ':[2,3,4,6,7,8,4,3,np.nan,5,2,3,4,5,6,7,5,4,8,3,2]})

df = pd.DataFrame(data2)

Frame tagID   TX   TY   TZ
0       1   nb1  5.0  4.0  2.0
1       1   nb2  2.0  2.0  3.0
2       1   nb3  3.0  3.0  4.0
3       2   nb1  4.0  4.0  6.0
4       2   nb2  5.0  5.0  7.0
5       2   nb3  6.0  9.0  8.0
6       3   nb1  7.0  3.0  4.0
7       3   nb2  5.0  2.0  3.0
8       3   nb3  NaN  NaN  NaN
9       4   nb1  5.0  5.0  5.0
10      4   nb2  2.0  2.0  2.0
11      4   nb3  3.0  3.0  3.0
12      5   nb1  4.0  4.0  4.0
13      5   nb2  5.0  5.0  5.0
14      5   nb3  6.0  6.0  6.0
15      6   nb1  7.0  7.0  7.0
16      6   nb2  5.0  5.0  5.0
17      6   nb3  4.0  4.0  4.0
18      7   nb1  8.0  8.0  8.0
19      7   nb2  3.0  3.0  3.0
20      7   nb3  2.0  2.0  2.0


# Calculate the squared distance between all x points:

TXdf = [] 
for i in range(1,df['Frame'].max()+1):
    boox = df['Frame'] == i 
    tempx = df[boox] 
    tx=tempx['TX'].apply(lambda x : (tempx['TX']-x)**2) 
    tx.columns=tempx.tagID   
    tx['ID']=tempx.tagID 
    tx['Frame'] = tempx.Frame 
    TXdf.append(tx) 
TXdfFinal = pd.concat(TXdf) # once all df for every 
print(TXdfFinal)
TXdfFinal.info()

# Calculate the squared distance between all y points:

print('y-diff sum')
TYdf = [] 
for i in range(1,df['Frame'].max()+1):
    booy = df['Frame'] == i 
    tempy = df[booy] 
    ty=tempy['TY'].apply(lambda x : (tempy['TY']-x)**2) 
    ty.columns=tempy.tagID   
    ty['ID']=tempy.tagID 
    ty['Frame'] = tempy.Frame 
    TYdf.append(ty) 
TYdfFinal = pd.concat(TYdf) 
print(TYdfFinal)
TYdfFinal.info()

# Calculate the squared distance between all z points:

print('z-diff sum')
TZdf = [] 
for i in range(1,df['Frame'].max()+1):
    booz = df['Frame'] == i 
    tempz = df[booz] 
    tz=tempz['TZ'].apply(lambda x : (tempz['TZ']-x)**2) 
    tz.columns=tempz.tagID  
    tz['ID']=tempz.tagID 
    tz['Frame'] = tempz.Frame 
    TZdf.append(tz) 
TZdfFinal = pd.concat(TZdf)


# Add all squared differences together:

euSum = TXdfFinal + TYdfFinal + TZdfFinal

# Square root the sum of the differences of each coordinate for Euclidean distance and add Frame and ID columns back on:

euDist = euSum.loc[:, euSum.columns !='ID'].apply(lambda x: x**0.5)
euDist['tagID'] = list(TXdfFinal['ID'])
euDist['Frame'] = list(TXdfFinal['Frame'])


# Add the distance matrix to the original dataframe based on Frame and ID columns:

new_df = pd.merge(df, euDist,  how='left', left_on=['Frame','tagID'], right_on = ['Frame','tagID'])

   Frame tagID   TX   TY   TZ      nb1     nb2      nb3
0       1   nb1  5.0  4.0  2.0   0.0000  3.7417   3.0000
1       1   nb2  2.0  2.0  3.0   3.7417  0.0000   1.7321
2       1   nb3  3.0  3.0  4.0   3.0000  1.7321   0.0000
3       2   nb1  4.0  4.0  6.0   0.0000  1.7321   5.7446
4       2   nb2  5.0  5.0  7.0   1.7321  0.0000   4.2426
5       2   nb3  6.0  9.0  8.0   5.7446  4.2426   0.0000
6       3   nb1  7.0  3.0  4.0   0.0000  2.4495      NaN
7       3   nb2  5.0  2.0  3.0   2.4495  0.0000      NaN
8       3   nb3  NaN  NaN  NaN      NaN     NaN      NaN
9       4   nb1  5.0  5.0  5.0   0.0000  5.1962   3.4641
10      4   nb2  2.0  2.0  2.0   5.1962  0.0000   1.7321
11      4   nb3  3.0  3.0  3.0   3.4641  1.7321   0.0000
12      5   nb1  4.0  4.0  4.0   0.0000  1.7321   3.4641
13      5   nb2  5.0  5.0  5.0   1.7321  0.0000   1.7321
14      5   nb3  6.0  6.0  6.0   3.4641  1.7321   0.0000
15      6   nb1  7.0  7.0  7.0   0.0000  3.4641   5.1962
16      6   nb2  5.0  5.0  5.0   3.4641  0.0000   1.7321
17      6   nb3  4.0  4.0  4.0   5.1962  1.7321   0.0000
18      7   nb1  8.0  8.0  8.0   0.0000  8.6603  10.3923
19      7   nb2  3.0  3.0  3.0   8.6603  0.0000   1.7321
20      7   nb3  2.0  2.0  2.0  10.3923  1.7321   0.0000

I have tried using both: euclidean() and pdist() with metric=’euclidean’ but can’t get the iteration correct.

Any advice on how to get the same result but a lot faster would be greatly apprecieated.

Check https://stackoverflow.com/questions/47782104/compute-euclidean-distance-between-rows-of-two-pandas-dataframes/47782154#47782154? — BENY, May 20 '19 at 21:12
The code I used to get above is as follows: `ary = scipy.spatial.distance.cdist(df.iloc[:,2:5], df.iloc[:,2:5], metric='euclidean') pd.DataFrame(ary)` — VacciniumC, May 20 '19 at 21:30
When I use `eudist = scipy.spatial.distance.cdist(df.iloc[:,2:5], df.iloc[:,2:5], metric='euclidean')` as suggested, I do get a distance matrix, but it has no Frame or ID and is a 20x20 matrix not a 20x3 matrix which would be a 6880000x6880000 matrix with my full data. — VacciniumC, May 20 '19 at 21:45

score 1 · Accepted Answer · answered May 20 '19 at 21:54

1

method from scipy

from scipy.spatial import distance
df['nb1'],df['nb2'],df['nb3']=np.concatenate([distance.cdist(y, y, metric='euclidean') for x , y in df[['TX','TY','TZ']].groupby(df['Frame'])]).T

answered May 20 '19 at 21:54

BENY

317,841
20
164
234

1

This worked like a charm. It is much faster and by using the following code can get a df that I can add back to the original; `from scipy.spatial import distance di = df['nb1'],df['nb2'],df['nb3']=np.concatenate([distance.cdist(y, y, metric='euclidean') for x , y in df[['TX','TY','TZ']].groupby(df['Frame'])]).T di = pd.DataFrame(di) di = di.T di.rename(columns={'0': 'nb1', '1': 'nb2','2': 'nb3'}, inplace=True) di['Frame'] = df['Frame'] di['tagID'] = df['tagID']` – VacciniumC May 20 '19 at 22:47
Shouldn't it be `distance.cdist(x, y, metric='euclidean')`? That is, `x, y` instead of `y, y`. – rp1 Jun 27 '19 at 18:20

score 0 · Answer 2 · answered May 20 '19 at 21:31

0

You could try cutting down the number of for loops from 3 to 1. It looks like you're iterating through the same item three times. Try doing all the computation in one loop

That should cut down your timing by two thirds.

answered May 20 '19 at 21:31

user

105
1
11

How can I speed up my 3D Euclidean distance matrix code

2 Answers2