0

Firstly, I fit it on the corpus of sms:

from sklearn.feature_extraction.text import CountVectorizer
clf = CountVectorizer()
X_desc = clf.fit_transform(X).toarray()

Seems to works fine:

X.shape = (5574,)
X_desc.shape = (5574, 8713)

But then I applied transform method to the textline, as we know, it should have (, 8713) shape as a result, but what we see:

str2 = 'Have you visited the last lecture on physics?'
print len(str2), clf.transform(str2).toarray().shape

52 (52, 8713)

What is going on here? One more thing - all numbers are zeros

maxymoo
  • 35,286
  • 11
  • 92
  • 119
Rocketq
  • 5,423
  • 23
  • 75
  • 126

1 Answers1

4

You always need to pass an array or vector to transform; if you just want to transform a single element, you need to pass a singleton array, and then extract its contents:

clf.transform([str1])[0]

Incidentally the reason that you are getting a 2-dimensional array as output is that the a string is actually stored as a list of characters, and so the vectoriser is treating your string as an array, where each character is being considered as a single document.

maxymoo
  • 35,286
  • 11
  • 92
  • 119