Most efficient way of computing pairwise cosine similarity for large DataFrame

Question

I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so:

ID                     Array1    
1         [2.4252 ... 5.6363] 
2         [3.1242 ... 9.0091] 
3         [6.6775 ... 12.958]  
...
300000    [0.1260 ... 5.3323]

I then generate a new numpy array (let's call it array2) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity and save the results in a new column:

from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)

Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.

I am beginning to learn about Vaex and Dask as alternatives to pandas but am failing to convert the code I provided to a working equivalent that is also faster.

Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?

You may not need Dask here. Take a look at https://docs.python.org/2/library/multiprocessing.html, esp. pool.map(function, iterable) — ndr, Jan 06 '23 at 03:21

greenstreets2 · Answer 1 · 2023-01-06T01:09:52.513

You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).

import faiss

dimension = 100

array1 = np.random.random((n, dimension)).astype('float32')


index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():  
    index.add(row)

k= len(df)
D, I = index.search(array1, k)

Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).

Most efficient way of computing pairwise cosine similarity for large DataFrame

1 Answers1