-1

My original data is pretty large. It is about: data =

[[0, 0, 0, ......0]
 [0, 0.124, 0, ..0]
         .
         .
         .
 [0, 0, 0, 0, 0.174]]

data2 =

[[0, 0, 0, ......0]
 [0, 0.74, 0, ..,0]
         .
         .
         .
 [0, 0, 0.15, 0, 0]]

10 matrix in data and data2 each matrix have 3687 value

I want to compute the cosine similarity of each matrix, that's like the first matrix in data compute the first and second to the last matrix in data2 and so on I want to get a 10X10 similarity score and I use sklearn and use sklearn.metrics.pairwise to fit the model and compute the cosine similarity:

import numpy as np
from sklearn import manifold
A = np.matrix(cop)
A = 1.-A
model = manifold.TSNE(metric="precomputed")
Y = model.fit_transform(A)

but it shows:

X should be a square distance matrix

I use a much simpler data as a trial and it does fit.

How to compute the cosine similarity and get a 10X10 cosine score?

賴韋安
  • 53
  • 4
  • What exactly is `cop` in your code above? Is that supposed to be the 10x10 cosine similarity matrix? – tel Nov 20 '18 at 11:11
  • It is about a probability after doing lda to a document. I got 10 topics and each topics have 10 words and I want to compute cos with to result of two different lda. So I made a 0 to 3687 matrix for ten topics because two different lda have combined 3687 unique term and base on the word's ID to give them their corresponding probability, so totally 36870 value in data and data2. Only 10 value is non-zero in the matrix and totally 100 in data. It will be tedious to post all of my code.... – 賴韋安 Nov 20 '18 at 11:49
  • In the answer I posted below, `dist` is the 10x10 cosine similarity matrix. If that's all you want, just ignore the `TSNE` stuff below it. – tel Nov 20 '18 at 12:18
  • 1
    If you're looking for something else, you're going to have to clarify your question. Probably you should at least add an example of desired input/desired output. The example input doesn't have to be your complete 10x3687 datasets (simplified versions with fewer rows/cols are fine), but it can't have any dots/ellipses in it like it currently does. Otherwise it's never going to be clear exactly what you want. Here's some [docs for writing good example code for a question on this site](https://stackoverflow.com/help/mcve) – tel Nov 20 '18 at 12:19

1 Answers1

0

The exact nature of your problem depends on what cop is in your code. You may have to post a more complete example of your buggy code to get a good answer.

Here's a complete example (with random data) of using cosine_similarity with TSNE:

import numpy as np
from sklearn import manifold
from sklearn.metrics.pairwise import cosine_similarity

data1 = np.random.rand(10,3687)
data2 = np.random.rand(10,3687)
dist = cosine_similarity(data1, data2)

model = manifold.TSNE(metric="precomputed")
Y = model.fit_transform(dist)
tel
  • 13,005
  • 2
  • 44
  • 62