I have a 300.000 row pd.DataFrame
comprised of multiple columns, out of which, one is a 50-dimension numpy
array of shape (1,50)
like so:
ID Array1
1 [2.4252 ... 5.6363]
2 [3.1242 ... 9.0091]
3 [6.6775 ... 12.958]
...
300000 [0.1260 ... 5.3323]
I then generate a new numpy
array (let's call it array2
) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity
and save the results in a new column:
from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)
Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.
I am beginning to learn about Vaex
and Dask
as alternatives to pandas
but am failing to convert the code I provided to a working equivalent that is also faster.
Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?