0

i am working on a multilabel classification problem as

import pandas as pd
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier 
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

tdf = pd.read_csv("data.csv", index_col="DocID",error_bad_lines=False)[:8]

print tdf

gives me

DocID   Content             Tags           
1       some text here...   [70]
2       some text here...   [59]
3       some text here...  [183]
4       some text here...  [173]
5       some text here...   [71]
6       some text here...   [98]
7       some text here...  [211]
8       some text here...  [188]

then i identify and transform the columns as needed

X=tdf["Content"]
y=tdf["Tags"]

t=TfidfVectorizer()
print t.fit_transform(X).toarray()
print MultiLabelBinarizer().fit_transform(y)

gives me

[[ 0.          0.01058315  0.         ...,  0.00529157  0.          0.        ]
 [ 0.          0.00947091  0.         ...,  0.00473545  0.          0.        ]
 [ 0.01190602  0.00950931  0.         ...,  0.00475465  0.          0.        ]
 ..., 
 [ 0.          0.01314373  0.         ...,  0.00657187  0.          0.        ]
 [ 0.          0.01200425  0.37574455 ...,  0.00600212  0.01502978  0.        ]
 [ 0.          0.02206688  0.         ...,  0.01103344  0.          0.        ]]

 [[1 0 0 0 0 1 0 0 1 1]
 [0 0 0 0 1 0 0 1 1 1]
 [0 1 0 1 0 0 1 0 1 1]
 [0 1 0 1 0 1 0 0 1 1]
 [0 1 0 0 0 1 0 0 1 1]
 [0 0 0 0 0 0 1 1 1 1]
 [0 1 1 0 0 0 0 0 1 1]
 [0 1 0 0 0 0 1 0 1 1]]

Looking at my data, shouldn't there be only 8 columns here for y? why are there 10 columns?

then i split,transform,fit and score

Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)

Xtrain=t.fit_transform(Xtrain).toarray()
Xvalidate=t.fit_transform(Xvalidate).toarray()

ytrain=MultiLabelBinarizer().fit_transform(ytrain)
yvalidate=MultiLabelBinarizer().fit_transform(yvalidate)

clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(Xtrain, ytrain)

print "One vs rest accuracy: %.3f"  % clf.score(Xvalidate,yvalidate)

but i get the error

print "One vs rest accuracy: %.3f"  % clf.score(Xvalidate,yvalidate)
  File "X:\Anaconda2\lib\site-packages\sklearn\base.py", line 310, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "X:\Anaconda2\lib\site-packages\sklearn\multiclass.py", line 325, in predict
    indices.extend(np.where(_predict_binary(e, X) > thresh)[0])
  File "X:\Anaconda2\lib\site-packages\sklearn\multiclass.py", line 83, in _predict_binary
    score = np.ravel(estimator.decision_function(X))
  File "X:\Anaconda2\lib\site-packages\sklearn\linear_model\base.py", line 249, in decision_function
    % (X.shape[1], n_features))
ValueError: X has 1546 features per sample; expecting 1354

what does this error mean? could it be the data? i have worked with the exact same algorithm with similar (same column format and data format) data and did not have a problem. Also, why does the fit function work?

What am i doing wrong here?

EDIT

so in my Tags column, the data is being read as string. hence the two extra columns in y. i tried

X=tdf["Content"]
y=tdf["Tags"]
y = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in y]

to accommodate for multiple values, but i still the same error. at least i get the correct number of columns for y.

AbtPst
  • 7,778
  • 17
  • 91
  • 172

1 Answers1

1

When you call fit_transform() you are first adjusting the feature extractor to the data (fit part) and then transforming the data (transform part). By calling fit_transform() multiple times on the same feature extractor (with different data) you are performing different fits, e.g. your TFIDF Vectorizer might learn one vocabulary for your training set and a completely different one for the validation set, which results in a different number of columns (different number of unique words). You have to call fit_transform() on X and y first and split to training and validation set afterwards (one fit, one transform). Alternatively you can call fit_transform() to generate the training set and then just transform() to generate the validation set (one fit, multiple transforms).

aleju
  • 2,376
  • 1
  • 17
  • 10
  • worked ! no errors but now i get 0 accuracy. is that due to the data? – AbtPst Dec 15 '15 at 23:24
  • but wait, what if my test set contains some unseen terms? how will i predict then? – AbtPst Dec 15 '15 at 23:26
  • 1
    I think the TFIDF vectorizer will just ignore words during transform that it hasnt seen during fit, so they shouldn't have any influence on the result. Not sure about the accuracy. If you have only 8 documents as examples that won't be enough data. – aleju Dec 15 '15 at 23:35