0

I have written the following code to compute the cosine similarity between a number of preprocessed document (stop word removal, stemming and term frequency-inverse document frequency).

print(X.shape)
similarity = []
for each in X:
    similarity.append(cosine_similarity(X[i:1], X))
    print(cosine_similarity(X[i:1], X))
    i = i+1

However, when I run it I receive this:

(2235, 7791)
[[ 1.          0.01490594  0.11752643 ...,  0.00941571  0.03652551
   0.01239277]]
Traceback (most recent call last):
  File "...", line 83, in <module>
    similarity.append(cosine_similarity(X[i:1], X))
  File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 881, in cosine_similarity
    X, Y = check_pairwise_arrays(X, Y)
  File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/metrics/pairwise.py", line 96, in check_pairwise_arrays
    X = check_array(X, accept_sparse='csr', dtype=dtype)
  File "/Users/.../anaconda/lib/python3.5/site-packages/sklearn/utils/validation.py", line 407, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 7791)) while a minimum of 1 is required.
[Finished in 56.466s]
user7347576
  • 236
  • 2
  • 5
  • 15
  • 2
    You are using X[i:1] inside your loop. When i reaches 1, you are accessing X[1:1] which returns an empty list. That's causing the error. – Dileep Kumar Patchigolla Feb 01 '17 at 02:14
  • @DileepKumarPatchigolla How can I do it then? – user7347576 Feb 01 '17 at 11:29
  • I am not familiar with the cosine_similarity. Can you provide the sample of how X looks like, so I can try it out? – Dileep Kumar Patchigolla Feb 01 '17 at 12:13
  • Welcome to StackOverflow. Please read and follow the posting guidelines in the help documentation. [Minimal, complete, verifiable example](http://stackoverflow.com/help/mcve) applies here. We cannot effectively help you until you post your MCVE code and accurately describe the problem. In the code you posted, **cosine_similarity**, **i**, and **X** are undefined, so it's not clear what you're doing. – Prune Feb 01 '17 at 18:01
  • You could try like this to get the cosine similarity between the first vector and the rest: `sklearn.metrics.pairwise.pairwise_distances(X[0:1], X, metric='cosine', n_jobs=1)` . http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html (http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/ ) – mkaran Feb 02 '17 at 13:32

1 Answers1

0

It's not clear what you're trying to achieve. You're taking a cosine similarity between a slice of the matrix X and the entire matrix. The slice is empty except when i == 0. Your for statement iterates through the matrix, but you never use the iteration variable each.

Cosine similarity is an operation between two vectors of equal length. For instance, you can compute the similarity between row i and row j with

cosine_similarity(X[i], X[j])

If you want all of the row-to-row similarities computed in a list, use a list comprehension:

similarity = [cosine_similarity(a, b) for a in X for b in X]

Does that get you moving?

Prune
  • 76,765
  • 14
  • 60
  • 81