0

i am trying to solve a multilabel classification problem as

        from sklearn.preprocessing import MultiLabelBinarizer 

        traindf = pickle.load("traindata.pkl","rb"))
        X = traindf['Col1']
        X=MultiLabelBinarizer().fit_transform(X)

        y = traindf['Col2']
        y= MultiLabelBinarizer().fit_transform(y)

        Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)
        from sklearn.linear_model import LogisticRegression

        clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(Xtrain,ytrain)

        print "One vs rest accuracy: %.3f"  % clf.score(Xvalidate,yvalidate)

in this way, i always get 0 accuracy. Please point out if i am doing something wrong. i am new to multilabel classification. Here is what my data looks like

Col1                  Col2
asd dfgfg             [1,2,3]
poioi oiopiop         [4]

EDIT

Thanks for your help @lejlot. I think i am getting the hang of it. Here is what i tried

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier 
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

tdf = pd.read_csv("mul.csv", index_col="DocID",error_bad_lines=False)

print tdf

so my input data looks like

DocID   Content           Tags    
1       abc abc abc       [1]
2       asd asd asd       [2]
3       abc abc asd     [1,2]
4       asd asd abc     [1,2]
5       asd abc qwe   [1,2,3]
6       qwe qwe qwe       [3]
7       qwe qwe abc     [1,3]
8       qwe qwe asd     [2,3]

so this is just some test data i created. then i do

text_clf = Pipeline([
                     ('vect', TfidfVectorizer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                            alpha=1e-3, n_iter=5, random_state=42)),
 ])

t=TfidfVectorizer()
X=t.fit_transform(tdf["Content"]).toarray()
print X

this gives me

[[ 1.          0.          0.        ]
 [ 0.          1.          0.        ]
 [ 0.89442719  0.4472136   0.        ]
 [ 0.4472136   0.89442719  0.        ]
 [ 0.55247146  0.55247146  0.62413987]
 [ 0.          0.          1.        ]
 [ 0.40471905  0.          0.91444108]
 [ 0.          0.40471905  0.91444108]]

then

y=tdf['Tags']
y=MultiLabelBinarizer().fit_transform(y)

print y

gives me

[[0 1 0 0 1 1]
 [0 0 1 0 1 1]
 [1 1 1 0 1 1]
 [1 1 1 0 1 1]
 [1 1 1 1 1 1]
 [0 0 0 1 1 1]
 [1 1 0 1 1 1]
 [1 0 1 1 1 1]]

here i am wondering why there are 6 column? shouldn't there be only 3? anyway, then i also created a test data file

sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf

so this looks like

DocID  Content        PredTags             
34     abc abc qwe    [1,3]
35     asd abc asd    [1,2]
36     abc abc abc      [1]

i have the PredTags column to check for accuracy. So finally i fit and predict as

clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted

which gives me

[[1 1 1 1 1 1]
 [1 1 1 0 1 1]
 [1 1 1 0 1 1]]

Now, how do i know which tags are being predicted? How can i check the accuracy against my PredTags column?

Update

Thanks a lot @lejlot :) i also manged to get the accuracy as follows

sdf=pd.read_csv("multest.csv", index_col="DocID",error_bad_lines=False)
print sdf

predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print predicted


ty=sdf["PredTags"]
ty = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in ty]

yt=MultiLabelBinarizer().fit_transform(ty)
Xt=t.fit_transform(sdf["Content"]).toarray()

print Xt
print yt
print "One vs rest accuracy: %.3f"  % clf.score(Xt,yt)

i just had to binarize the test set prediction column as well :)

AbtPst
  • 7,778
  • 17
  • 91
  • 172
  • 1
    this is not how you should work with **text** in classification. read scikitlearn manual on working with text before proceeding to building any model. – lejlot Dec 15 '15 at 14:58
  • do you mean http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html – AbtPst Dec 15 '15 at 15:06
  • thanks, i will check it out. however at first glance it does not seem to have a use case for multilabel classification. – AbtPst Dec 15 '15 at 15:08
  • 1
    the multilabel classification is not an issue, you incorrectly work with text. Multilabel classification (in a basic form) **is just a set of independent binary classifiers**, nothing more. – lejlot Dec 15 '15 at 15:09
  • thanks man, i will try to learn. – AbtPst Dec 15 '15 at 15:09

1 Answers1

1

The actual problem is the way you work with text, you should extract some kind of features and use it as text representation. For example you can use bag of words representation, or tfidf, or any more complex approach.

So what is happening now? You call multilabelbinarizer on list of strings thus, scikit-learn creates a set of all iterables in the list... leading to the set of letters representation. So for example

from sklearn.preprocessing import MultiLabelBinarizer 
X = ['abc cde', 'cde', 'fff']
print MultiLabelBinarizer().fit_transform(X)

gives you

array([[1, 1, 1, 1, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 1]])

        |  |  |  |  |  |  |
        v  v  v  v  v  v  v

        a  b  _  c  d  e  f

Consequently classification is nearly impossible as this does not capture any meaning of your texts.

You could do for example a Count Vectorization (bag of words)

from sklearn.feature_extraction.text import CountVectorizer
print CountVectorizer().fit_transform(X).toarray()

gives you

      [[1  1  0]
       [0  1  0]
       [0  0  1]]

        |  |  |
        v  |  v
       abc | fff
           v
          cde

Update

Finally, to make predictions with labels, and not their binarization you need to store your binarizer thus

labels = MultiLabelBinarizer()
y = labels.fit_transform(y)

and later on

clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01)).fit(X,y)
predicted = clf.predict(t.fit_transform(sdf["Content"]).toarray())
print labels.inverse_transform(predicted)

Update 2

If you only have three classes then the vector should have 3 elements, yours have 6 so check what you are passing as "y", there is probably some mistake in your data

from sklearn.preprocessing import MultiLabelBinarizer
MultiLabelBinarizer().fit_transform([[1,2], [1], [3], [2]])

gives

array([[1, 1, 0],
       [1, 0, 0],
       [0, 0, 1],
       [0, 1, 0]])

as expected.

My best guess is that your "tags" are also strings thus you actually call

MultiLabelBinarizer().fit_transform(["[1,2]", "[1]", "[3]", "[2]"])

which leads to

array([[1, 1, 1, 0, 1, 1],
       [0, 1, 0, 0, 1, 1],
       [0, 0, 0, 1, 1, 1],
       [0, 0, 1, 0, 1, 1]])

        |  |  |  |  |  | 
        v  v  v  v  v  v  

        ,  1  2  3  [  ] 

And these are your 6 classes. Three true ones, 2 "trivial" classes "[" and "]" which are present always and also nearly trivial class "," which appears for every object beleonging to more than one class.

You should convert your tags to actual lists first, for example by

y = [map(int, list(_y.replace(',','').replace('[','').replace(']',''))) for _y in y]
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • so if i use CountVectorizer().fit_transform(X).toarray() as argument to my classifier.fit() will it work? – AbtPst Dec 15 '15 at 15:43
  • classification is not an algorithmic task, there is no such thing as "if I do X, will it work" - this is just one, of possibly thousands of steps required to do it right. This is just the most basic approach which **might** do something. – lejlot Dec 15 '15 at 16:09
  • thanks a lot for your help man. please see the edit. i feel that i am close to a solution. please correct me if i am wrong. really appreciate it – AbtPst Dec 15 '15 at 16:20
  • perfect man :) thank you so much. you are a really good teacher! – AbtPst Dec 15 '15 at 18:02