3

I have a pandas df of 2 columns each containing 2.7 million rows of normalized vectors of length 20.

I want to take the cosine sim of column1 - row1 vs column2- row1, column1 - row2 vs column2 - row2... so and and so forth until 2.7 million.

I have tried looping but this is extremely slow. What is the fastest way to do this?

here is what im using now:

for index, row in df.iterrows():
   x =  1 - spatial.distance.cosine(tempdf['unit_vector'][index], 
tempdf['ave_unit_vector'][index])
   print(index,x)

data:

tempdf['unit_vector']
Out[185]: 
0          [0.7071067811865475, 0.7071067811865475, 0.0, ...
1          [0.634997029655247, 0.634997029655247, 0.43995...
2          [0.5233710392524532, 0.5233710392524532, 0.552...
3          [0.4792468085399227, 0.4792468085399227, 0.505...
4          [0.4937468195427678, 0.4937468195427678, 0.492...
5          [0.49444897739151283, 0.49444897739151283, 0.5...
6          [0.49548793862403173, 0.49548793862403173, 0.4...
7          [0.5027211862475275, 0.5027211862475275, 0.495...
8          [0.5136216906905179, 0.5136216906905179, 0.489...
9          [0.5035958124287837, 0.5035958124287837, 0.508...
10         [0.5037995208120967, 0.5037995208120967, 0.493...


tempdf['ave_unit_vector']
Out[186]: 
0          [0.5024525269125278, 0.5024525269125278, 0.494...
1          [0.5010905514059507, 0.5010905514059507, 0.499...
2          [0.4993456468410199, 0.4993456468410199, 0.501...
3          [0.5005492367626839, 0.5005492367626839, 0.498...
4          [0.4999384715200533, 0.4999384715200533, 0.501...
5          [0.49836832120891517, 0.49836832120891517, 0.5...
6          [0.49842376222388335, 0.49842376222388335, 0.5...
7          [0.4984869391887457, 0.4984869391887457, 0.500...
8          [0.4990867844970344, 0.4990867844970344, 0.499...
9          [0.49977780370532715, 0.49977780370532715, 0.4...
10         [0.5003161478128204, 0.5003161478128204, 0.499...

This isnt the same dataset but will create a usable df. Columns 'B' and 'C':

df = pd.DataFrame(list(range(0,1000)),columns = ['A'])

for i in range(0,5):
   df['New_{}'.format(i)] = df['A'].shift(i).tolist()

cols = len(df.columns)
start_col = cols - 6

df['B'] = df.iloc[:,start_col:cols].values.tolist()
df['C'] = df['B'] * 2
  • Your description of your data isn't very clear. If I am understanding your correctly, you have a pandas `dtype=object` with *some sort of sequence* type as values. Fundamentally, this will always be slow, especially if you are looping using a python for-loop. If you are trying to get the pair-wise distances, then your algorithm is fundamentally O(n^2). – juanpa.arrivillaga Jul 16 '18 at 00:18
  • Depending on what you want to do, there may be solutions that don't require seeking every pair, perhaps using a clever data-structure like a kd-tree or a ball-tree, or if you actually need every comparison, there is a parallelized pair-wise distance implementation in scikit-learn. Or you could program your own, maybe in another language. But it *sounds* like you are abusing `pandas.DataFrames` to begin with. At the very least, you are being very memory-wasteful. – juanpa.arrivillaga Jul 16 '18 at 00:19
  • I added some data for better context. I probably am slightly abusing pandas but for the majority of my code pandas is best. I need to calculate it between each of the pairs because they are features for a ML model. – Federico Marchese Jul 16 '18 at 00:24
  • Look, I think I get what you are doing here, but you should really provide a [mcve]. It is the *least* you could do if you are going to ask people for help. Just printing your data-frame isn't really a reproducible example. But anyway, as I said, you should look in to an optimized pair-wise distance implementation, but you'll have to reorganize your data-frame (because you *are* abusing it) to get it to work with the [`scikit-learn`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html) implementation. – juanpa.arrivillaga Jul 16 '18 at 00:27
  • Okay thank you. I'll give it a look. – Federico Marchese Jul 16 '18 at 00:34
  • If the vectors are normalized (np.linalg.norm(vec) == 1.0) as you say, then the cosine distance is just the dot product between the vectors. – Marijn van Vliet Jul 16 '18 at 07:58
  • Using `np.apply_along_axis()` should be slightly faster than an normal for loop. The rest depends on your amount of RAM. With about 1.5 GB free you should be able to calculate without a for loop. If you have less, you could iterate over slices of say 100000 vectors at a time. – Joe Jul 16 '18 at 08:36
  • Thank you Joe, I will try this out – Federico Marchese Jul 16 '18 at 12:34

1 Answers1

3

This is the fastest way I have tried. Brought the calculation down from over 30 minutes in a loop to about 5 seconds:

tempdf['vector_mult'] = np.multiply(tempdf['unit_vector'], tempdf['ave_unit_vector'])
tempdf['cosinesim'] = tempdf['vector_mult'].apply(lambda x: sum(x))

This works because my vectors are already unit vectors.

The first function multiplies the vectors in the two columns by row. The second function sums them by row. The challenge here was that no pre-built function wanted to solve row by row. Instead it wanted to sum the vectors in each column then calculate the result.

  • can you share how you created unit vector? I created unit vector using preprocessing.normalize(df, norm= 'l2'), but when I multiply vector I get can't multiply sequence by non-int of type 'list', since the column is not a numpy array. – Kundan Oct 22 '21 at 12:43