-2

I have created code to calculate the distance of all objects (tagID) from one another based on x, y, z coordinates (TX, TY, TZ) at each time step (Frame). While this code does work, it is too slow for what I need. My current test data, has about 538,792 rows of data, my actual data will be about 6,880,000 lines of data. Currently it takes a few minutes (maybe 10-15) to make these distance matrices, and since I will have 40 sets of data, I woud like to speed thigs up.

The current code is as follows:

# Sample data frame with correct columns:

data2 = ({'Frame' :[1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7,7], 
      'tagID' : ['nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3','nb1','nb2','nb3'],
      'TX':[5,2,3,4,5,6,7,5,np.nan,5,2,3,4,5,6,7,5,4,8,3,2],
      'TY':[4,2,3,4,5,9,3,2,np.nan,5,2,3,4,5,6,7,5,4,8,3,2],
      'TZ':[2,3,4,6,7,8,4,3,np.nan,5,2,3,4,5,6,7,5,4,8,3,2]})

df = pd.DataFrame(data2)

Frame tagID   TX   TY   TZ
0       1   nb1  5.0  4.0  2.0
1       1   nb2  2.0  2.0  3.0
2       1   nb3  3.0  3.0  4.0
3       2   nb1  4.0  4.0  6.0
4       2   nb2  5.0  5.0  7.0
5       2   nb3  6.0  9.0  8.0
6       3   nb1  7.0  3.0  4.0
7       3   nb2  5.0  2.0  3.0
8       3   nb3  NaN  NaN  NaN
9       4   nb1  5.0  5.0  5.0
10      4   nb2  2.0  2.0  2.0
11      4   nb3  3.0  3.0  3.0
12      5   nb1  4.0  4.0  4.0
13      5   nb2  5.0  5.0  5.0
14      5   nb3  6.0  6.0  6.0
15      6   nb1  7.0  7.0  7.0
16      6   nb2  5.0  5.0  5.0
17      6   nb3  4.0  4.0  4.0
18      7   nb1  8.0  8.0  8.0
19      7   nb2  3.0  3.0  3.0
20      7   nb3  2.0  2.0  2.0


# Calculate the squared distance between all x points:

TXdf = [] 
for i in range(1,df['Frame'].max()+1):
    boox = df['Frame'] == i 
    tempx = df[boox] 
    tx=tempx['TX'].apply(lambda x : (tempx['TX']-x)**2) 
    tx.columns=tempx.tagID   
    tx['ID']=tempx.tagID 
    tx['Frame'] = tempx.Frame 
    TXdf.append(tx) 
TXdfFinal = pd.concat(TXdf) # once all df for every 
print(TXdfFinal)
TXdfFinal.info()

# Calculate the squared distance between all y points:

print('y-diff sum')
TYdf = [] 
for i in range(1,df['Frame'].max()+1):
    booy = df['Frame'] == i 
    tempy = df[booy] 
    ty=tempy['TY'].apply(lambda x : (tempy['TY']-x)**2) 
    ty.columns=tempy.tagID   
    ty['ID']=tempy.tagID 
    ty['Frame'] = tempy.Frame 
    TYdf.append(ty) 
TYdfFinal = pd.concat(TYdf) 
print(TYdfFinal)
TYdfFinal.info()

# Calculate the squared distance between all z points:

print('z-diff sum')
TZdf = [] 
for i in range(1,df['Frame'].max()+1):
    booz = df['Frame'] == i 
    tempz = df[booz] 
    tz=tempz['TZ'].apply(lambda x : (tempz['TZ']-x)**2) 
    tz.columns=tempz.tagID  
    tz['ID']=tempz.tagID 
    tz['Frame'] = tempz.Frame 
    TZdf.append(tz) 
TZdfFinal = pd.concat(TZdf)


# Add all squared differences together:

euSum = TXdfFinal + TYdfFinal + TZdfFinal

# Square root the sum of the differences of each coordinate for Euclidean distance and add Frame and ID columns back on:

euDist = euSum.loc[:, euSum.columns !='ID'].apply(lambda x: x**0.5)
euDist['tagID'] = list(TXdfFinal['ID'])
euDist['Frame'] = list(TXdfFinal['Frame'])


# Add the distance matrix to the original dataframe based on Frame and ID columns:

new_df = pd.merge(df, euDist,  how='left', left_on=['Frame','tagID'], right_on = ['Frame','tagID'])

   Frame tagID   TX   TY   TZ      nb1     nb2      nb3
0       1   nb1  5.0  4.0  2.0   0.0000  3.7417   3.0000
1       1   nb2  2.0  2.0  3.0   3.7417  0.0000   1.7321
2       1   nb3  3.0  3.0  4.0   3.0000  1.7321   0.0000
3       2   nb1  4.0  4.0  6.0   0.0000  1.7321   5.7446
4       2   nb2  5.0  5.0  7.0   1.7321  0.0000   4.2426
5       2   nb3  6.0  9.0  8.0   5.7446  4.2426   0.0000
6       3   nb1  7.0  3.0  4.0   0.0000  2.4495      NaN
7       3   nb2  5.0  2.0  3.0   2.4495  0.0000      NaN
8       3   nb3  NaN  NaN  NaN      NaN     NaN      NaN
9       4   nb1  5.0  5.0  5.0   0.0000  5.1962   3.4641
10      4   nb2  2.0  2.0  2.0   5.1962  0.0000   1.7321
11      4   nb3  3.0  3.0  3.0   3.4641  1.7321   0.0000
12      5   nb1  4.0  4.0  4.0   0.0000  1.7321   3.4641
13      5   nb2  5.0  5.0  5.0   1.7321  0.0000   1.7321
14      5   nb3  6.0  6.0  6.0   3.4641  1.7321   0.0000
15      6   nb1  7.0  7.0  7.0   0.0000  3.4641   5.1962
16      6   nb2  5.0  5.0  5.0   3.4641  0.0000   1.7321
17      6   nb3  4.0  4.0  4.0   5.1962  1.7321   0.0000
18      7   nb1  8.0  8.0  8.0   0.0000  8.6603  10.3923
19      7   nb2  3.0  3.0  3.0   8.6603  0.0000   1.7321
20      7   nb3  2.0  2.0  2.0  10.3923  1.7321   0.0000

I have tried using both: euclidean() and pdist() with metric=’euclidean’ but can’t get the iteration correct.

Any advice on how to get the same result but a lot faster would be greatly apprecieated.

VacciniumC
  • 191
  • 3
  • Check https://stackoverflow.com/questions/47782104/compute-euclidean-distance-between-rows-of-two-pandas-dataframes/47782154#47782154? – BENY May 20 '19 at 21:12
  • The code I used to get above is as follows: `ary = scipy.spatial.distance.cdist(df.iloc[:,2:5], df.iloc[:,2:5], metric='euclidean') pd.DataFrame(ary)` – VacciniumC May 20 '19 at 21:30
  • Do it under groupby – BENY May 20 '19 at 21:32
  • When I use `eudist = scipy.spatial.distance.cdist(df.iloc[:,2:5], df.iloc[:,2:5], metric='euclidean')` as suggested, I do get a distance matrix, but it has no Frame or ID and is a 20x20 matrix not a 20x3 matrix which would be a 6880000x6880000 matrix with my full data. – VacciniumC May 20 '19 at 21:45
  • Where in the code do I use the groupby? – VacciniumC May 20 '19 at 21:47
  • I have add it as a answer ... – BENY May 20 '19 at 21:54

2 Answers2

1

method from scipy

from scipy.spatial import distance
df['nb1'],df['nb2'],df['nb3']=np.concatenate([distance.cdist(y, y, metric='euclidean') for x , y in df[['TX','TY','TZ']].groupby(df['Frame'])]).T
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    This worked like a charm. It is much faster and by using the following code can get a df that I can add back to the original; `from scipy.spatial import distance di = df['nb1'],df['nb2'],df['nb3']=np.concatenate([distance.cdist(y, y, metric='euclidean') for x , y in df[['TX','TY','TZ']].groupby(df['Frame'])]).T di = pd.DataFrame(di) di = di.T di.rename(columns={'0': 'nb1', '1': 'nb2','2': 'nb3'}, inplace=True) di['Frame'] = df['Frame'] di['tagID'] = df['tagID']` – VacciniumC May 20 '19 at 22:47
  • Shouldn't it be `distance.cdist(x, y, metric='euclidean')`? That is, `x, y` instead of `y, y`. – rp1 Jun 27 '19 at 18:20
0

You could try cutting down the number of for loops from 3 to 1. It looks like you're iterating through the same item three times. Try doing all the computation in one loop

That should cut down your timing by two thirds.

user
  • 105
  • 1
  • 11