0

I am trying to do KNN using Cosine Similarity in SciKIt Learn but it keep throwing these warnings. Can someone explain what is the meaning of these and why is it only coming when I am trying to fit a KNN model with cosine similarity and not with any other distance metric?

Code:

t0 = time.time()
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

vectorizer = TfidfVectorizer()
vec_fit = vectorizer.fit_transform(X)

t1 = time.time()
total = t1-t0
print "TF-IDF built:", total

#######################------------------------############################

t0 = time.time()
nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto', metric=cosine_similarity)
nbrs.fit(X_train_tfidf.toarray())#,Y)
#KD_TREE won't work here becuase it doesn't work with Sparse Matrix -- on giving it a dense matrix, it throws a memory error

t1 = time.time()
total = t1-t0
print "KNN Built:", total

Repeated Warning Msg:

C:\Anaconda2\lib\site-packages\sklearn\utils\validation.py:386: DeprecationWarning: Passing 1d arrays as data is depreca
ted in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single
feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

Upon Suggestion Tried doing this:

nbrs = NearestNeighbors(n_neighbors=20, algorithm='auto', metric=cosine_similarity)
nbrs.fit(numpy.array(X_train_tfidf).reshape(1, -1))

which throws the following error:

Traceback (most recent call last):
  File ".\tf-idf.py", line 54, in <module>
    nbrs.fit(numpy.array(X_train_tfidf).reshape(1, -1))
  File "C:\Miniconda2\lib\site-packages\sklearn\neighbors\base.py", line 816, in fit
    return self._fit(X)
  File "C:\Miniconda2\lib\site-packages\sklearn\neighbors\base.py", line 221, in _fit
    X = check_array(X, accept_sparse='csr')
  File "C:\Miniconda2\lib\site-packages\sklearn\utils\validation.py", line 373, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
silent_dev
  • 1,566
  • 3
  • 20
  • 45

1 Answers1

0

For me it does not make sense for this not to show with other metrics (like linear_kernel), I guess this is something they have forgotten(?) to update because both(linear_kernel and cosine_similarity) are kernel operations.

To the matter at hand , you are getting this error because the fit() method expects a 2-dimensional array , but you are passing a 1-dimensional one. for instance this will raise this warning X_train_tfidf=np.array([1,2,3,4.234,213.2]) as it has shape 5.On the other hand this will not X_train_tfidf=np.array([[1,2,3,4.234,213.2]]), because it has shape (5,1) and is therefore 2-dimensional.

what the warning message suggests is to take your 1-dimensional array and convert it to 2-dimensional like X_train_tfidf=np.array([1,2,3,4.234,213.2]).reshape(1, -1) which is equivalent to X_train_tfidf=np.array([[1,2,3,4.234,213.2]])

Kernel matrices are basically children of linear algebra and involve matrix operations which are by default 2-dimensional.

Hope it makes sense, if not, please shout.

kazAnova
  • 219
  • 1
  • 7