I have a pandas df of 2 columns each containing 2.7 million rows of normalized vectors of length 20.
I want to take the cosine sim of column1 - row1 vs column2- row1, column1 - row2 vs column2 - row2... so and and so forth until 2.7 million.
I have tried looping but this is extremely slow. What is the fastest way to do this?
here is what im using now:
for index, row in df.iterrows():
x = 1 - spatial.distance.cosine(tempdf['unit_vector'][index],
tempdf['ave_unit_vector'][index])
print(index,x)
data:
tempdf['unit_vector']
Out[185]:
0 [0.7071067811865475, 0.7071067811865475, 0.0, ...
1 [0.634997029655247, 0.634997029655247, 0.43995...
2 [0.5233710392524532, 0.5233710392524532, 0.552...
3 [0.4792468085399227, 0.4792468085399227, 0.505...
4 [0.4937468195427678, 0.4937468195427678, 0.492...
5 [0.49444897739151283, 0.49444897739151283, 0.5...
6 [0.49548793862403173, 0.49548793862403173, 0.4...
7 [0.5027211862475275, 0.5027211862475275, 0.495...
8 [0.5136216906905179, 0.5136216906905179, 0.489...
9 [0.5035958124287837, 0.5035958124287837, 0.508...
10 [0.5037995208120967, 0.5037995208120967, 0.493...
tempdf['ave_unit_vector']
Out[186]:
0 [0.5024525269125278, 0.5024525269125278, 0.494...
1 [0.5010905514059507, 0.5010905514059507, 0.499...
2 [0.4993456468410199, 0.4993456468410199, 0.501...
3 [0.5005492367626839, 0.5005492367626839, 0.498...
4 [0.4999384715200533, 0.4999384715200533, 0.501...
5 [0.49836832120891517, 0.49836832120891517, 0.5...
6 [0.49842376222388335, 0.49842376222388335, 0.5...
7 [0.4984869391887457, 0.4984869391887457, 0.500...
8 [0.4990867844970344, 0.4990867844970344, 0.499...
9 [0.49977780370532715, 0.49977780370532715, 0.4...
10 [0.5003161478128204, 0.5003161478128204, 0.499...
This isnt the same dataset but will create a usable df. Columns 'B' and 'C':
df = pd.DataFrame(list(range(0,1000)),columns = ['A'])
for i in range(0,5):
df['New_{}'.format(i)] = df['A'].shift(i).tolist()
cols = len(df.columns)
start_col = cols - 6
df['B'] = df.iloc[:,start_col:cols].values.tolist()
df['C'] = df['B'] * 2