euclidean distance between two big pandas dataframes

Question

I have three dataframes df1 with 1 160 164 rows and 4 variables,df2 with 11241 rows and 4 variables, and df3 with 1 630 644 rows and 6 variables

df1 looks like :
df1

df2 looks like :
df2

The observations in df1 are those in df3 with energy_kcal_100g_nettoye full.

The observations in df2 are those in df3 with energy_kcal_100g_nettoye no available.

df3 looks like :
df3

I need to find euclidean distance between each rows of df1 and df2 (not within df1 or df2). Then i need to keep the 5 closest index to compute the mean of energy_kcal_100g_nettoye on the 5 index in df3.

I try using this code but it never ends :

for index, row in df2.iterrows():
    dist_matrix = df1.apply(lambda row2: [np.linalg.norm(row2.values - row.values)], axis=1)
    dist_matrix=dist_matrix.sort_values()
    observation=dist_matrix[0:5].index
    echantillon=df3.loc[observation]
    df3.loc[index,'energy_kcal_100g_mean_distance_5']=echantillon['energy_kcal_100g_nettoye'].mean()

I want to use vectorization to do it faster but i don't succeed. My data is too big.

Can you help me pleased

Thanks

Ps: sorry for my english

Are you looking to create a matrix of size `len(df1) x len(df2)`? — Quang Hoang, Mar 25 '21 at 17:34
yes but i can't. i have a memoryerror. What can i do ? if is it possible ;) — Cocogne, Mar 26 '21 at 17:21
The general answer is **No**, you are trying to allocate `470. GiB` of data. You can try to save them on disk, but expect to reserve about that much space. — Quang Hoang, Mar 26 '21 at 17:22

maousi · Answer 1 · 2021-03-27T22:49:03.040

You should use broadcasting to compute this with vertorization. This post actually gives the answer, however, you don't need to compute the square root if you just want to find the closest samples (minimizing the squared distance is equivalent to minimizing the distance).

For the sake of the example, with df1 of shape (1000, 4) and df2 of shape (200, 4):

>>> diff = df1.values - df2.values[:, None, :]
>>> diff.shape
(200, 1000, 4)
>>> dist_squared = np.square(diff).sum(axis=2)
>>> dist_squared.shape
(200, 1000)

You now have a 200x1000 matrix of squared distances between rows of each dataframe. But this is unclear what you want to do now. You can identify the 5 closest pairs of rows, but this gives you potentially 2 x 5 = 10 different rows (those in df1, and those in df2). You could compute:

>>> closest = dist_squared.argpartition(kth=5, axis=None)[:5]
>>> ids_df1, ids_df2 = np.divmod(closest, 4)

where ids_df1 contains the indices of the 5 rows in df1 that are closest to any row in df2. But those indices are not necessarily unique, it depends on what you mean by "I want to keep the 5 closest indices".

--- EDIT

Indeed, this solution will not work with the size of your matrices (you get a memory error). Without using a distributed framework, using a for loop as you suggested may be the simplest thing. But then, you should still not use np.sqrt and use np.argpartition instead of full sorting, this will speed up the computations.

I've got a memory error with this solution. MemoryError: Unable to allocate 470. GiB for an array with shape (11052, 1142159, 4) and data type float64 — Cocogne, Mar 26 '21 at 17:19

Cocogne · Answer 2 · 2021-03-26T16:47:14.857

I want to find for each rows of df2 the five closest rows in df1 to calculate the mean of energy_kcal on these observations with df3. i do this for an imputation. I've got a memory error message when i try to create a matrix len(df1) * len(df2). so i try a loop on df2 to have len(df2) distance matrix but it's too long. it takes 45 min with this new code :

for index, row in df2.iterrows():
    dist = np.sqrt(np.square(df1 - row).sum(axis=1))
    dist=dist.sort_values()
    df3.loc[index, 'energy_kcal_100g_mean_distance_5']=df3.loc[dist[0:5].index,'energy_kcal_100g_nettoye'].mean()

is it possible to do it faster ?

euclidean distance between two big pandas dataframes

2 Answers2