1

I downloaded the data.

news = datasets.fetch_20newsgroups(subset='all', categories=['alt.atheism', 'sci.space'])
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups.data)
y = news.target
print(X.shape)

The shape of X is (1786, 28382)

Next I trained the model and got the coef_ shape

clf = svm.SVC(kernel='linear', random_state=241, C = 1.0000000000000001e-05)
clf.fit(X, y)
data = clf.coef_[0].data
print(data.shape)

The shape is (27189,)

Why the number of features are different?

IT_Nike
  • 91
  • 6
  • 1
    why do you even do clf.coef_[0].data ? this is supposed to be a **buffer**, not your data. print clf.coef_.shape – lejlot Sep 24 '16 at 10:30
  • @lejlot Yes, you are right, the shape of coef_ is (1, 28382). But the shape of clf.coef_.data is (27189,) too. How I can get all data? – IT_Nike Sep 24 '16 at 10:47
  • 1
    coef_ **is** your data. Leave data field alone :-) just take _coef[0][i] – lejlot Sep 24 '16 at 11:13
  • @lejlot Thanks a lot! I can't iterate over coef_[0], but clf.coef_[0].toarray() works fine) – IT_Nike Sep 24 '16 at 11:34
  • you can iterate, but not directly since if data is **sparse** so are coefs, and iterating over sparse arrays in python is a bit non trivial – lejlot Sep 24 '16 at 12:23
  • @lejlot Sorry can you help me one more) coef_.indices shape is (27189,) too)This topic is advised to use .data and .indicies http://stackoverflow.com/a/10364941/5446420 How I can get indicies? Thanks – IT_Nike Sep 24 '16 at 12:36
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124100/discussion-between-it-nike-and-lejlot). – IT_Nike Sep 24 '16 at 12:40
  • if you want to know how to iterate over sparse array, just ask a separate question, as this is a different problem (and probably already has lots of answers on SO). calling "toarray()" is a good idea here. – lejlot Sep 24 '16 at 12:54

1 Answers1

1

So in short everything is fine, your weight matrix is in clf.coef_. And it has valid shape, it is a regular numpy array (or scipy sparse array if data is sparse). You can do all needed operations on it, index it etc. What you tried, the .data field is attribute which holds internal storage of the array, which can be of different shape (since it might ignore some redundancies etc.), but the point is you should not use this internal attribute of numpy array for your purpose. It is exposed for low level methods, not for just reading out

lejlot
  • 64,777
  • 8
  • 131
  • 164