i am trying to solve a multilabel classification problem as
from sklearn.preprocessing import MultiLabelBinarizer
traindf = pickle.load("traindata.pkl","rb"))
X = traindf['Col1']
X=MultiLabelBinarizer().fit_transform(X)
y = traindf['Col2']
y= MultiLabelBinarizer().fit_transform(y)
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)
from sklearn.linear_model import LogisticRegression
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(Xtrain,ytrain)
print "One vs rest accuracy: %.3f" % clf.score(Xvalidate,yvalidate)
in this way, i always get 0 accuracy. Please point out if i am doing something wrong. i am new to multilabel classification. Here is what my data looks like
Col1 Col2
asd dfgfg [1,2,3]
poioi oiopiop [4]
EDIT
Thanks for your help @lejlot. I think i am getting the hang of it. Here is what i tried
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
tdf = pd.read_csv("mul.csv", index_col="DocID",error_bad_lines=False)
print tdf
so my input data looks like
DocID Content Tags
1 abc abc abc [1]
2 asd asd asd [2]
3 abc abc asd [1,2]
4 asd asd abc [1,2]
5 asd abc qwe [1,2,3]
6 qwe qwe qwe [3]
7 qwe qwe abc [1,3]
8 qwe qwe asd [2,3]
so this is just some test data i created. then i do
text_clf = Pipeline([
('vect', TfidfVectorizer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-3, n_iter=5, random_state=42)),
])
t=TfidfVectorizer()
X=t.fit_transform(tdf["Content"]).toarray()
print X
this gives me
[[ 1. 0. 0. ]
[ 0. 1. 0. ]
[ 0.89442719 0.4472136 0. ]
[ 0.4472136 0.89442719 0. ]
[ 0.55247146 0.55247146 0.62413987]
[ 0. 0. 1. ]
[ 0.40471905 0. 0.91444108]
[ 0. 0.40471905 0.91444108]]
then
y=tdf['Tags']
y=MultiLabelBinarizer().fit_transform(y)
print y
gives me
[[0 1 0 0 1 1]
[0 0 1 0 1 1]
[1 1 1 0 1 1]
[1 1 1 0 1 1]
[1 1 1 1 1 1]
[0 0 0 1 1 1]
[1 1 0 1 1 1]
[1 0 1 1 1 1]]
here i am wondering why there are 6 column? shouldn't there be only 3? anyway, then i also created a test data file
sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf
so this looks like
DocID Content PredTags
34 abc abc qwe [1,3]
35 asd abc asd [1,2]
36 abc abc abc [1]
i have the PredTags
column to check for accuracy. So finally i fit and predict as
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted
which gives me
[[1 1 1 1 1 1]
[1 1 1 0 1 1]
[1 1 1 0 1 1]]
Now, how do i know which tags are being predicted? How can i check the accuracy against my PredTags
column?
Update
Thanks a lot @lejlot :) i also manged to get the accuracy as follows
sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted
ty=sdf["PredTags"]
ty = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in ty]
yt=MultiLabelBinarizer().fit_transform(ty)
Xt=t.fit_transform(sdf["Content"]).toarray()
print Xt
print yt
print "One vs rest accuracy: %.3f" % clf.score(Xt,yt)
i just had to binarize the test set prediction column as well :)