2
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC

data = r'C:\Users\...\Downloads\news_v1.xlsx'

df = pd.read_excel(data)
df = pd.DataFrame(df.groupby(["id", "doc"]).label.apply(list)).reset_index()

X = np.array(df.doc)
y = np.array(df.label)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

mlb = preprocessing.MultiLabelBinarizer()
Y_train = mlb.fit_transform(y_train)

classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y_train)
predicted = classifier.predict(X_test)

Y_test = mlb.fit_transform(y_test)

print("Y_train: ", Y_train.shape)
print("Y_test: ", Y_test.shape)
print("Predicted: ", predicted.shape)
print("Accuracy Score: ", accuracy_score(Y_test, predicted))

I can't seems to do any measurements since Y_test gives a different matrix dimension after fit_transform with MultiLabelBinarizer.

Results and error:

Y_train:  (1278, 49)
Y_test:  (630, 42)
Predicted:  (630, 49)
Traceback (most recent call last):
  File "C:/Users/../PycharmProjects/MultiAutoTag/classifier.py", line 41, in <module>
    print("Accuracy Score: ", accuracy_score(Y_test, predicted))
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\classification.py", line 174, in accuracy_score
    differing_labels = count_nonzero(y_true - y_pred, axis=1)
  File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 361, in __sub__
    raise ValueError("inconsistent shapes")
ValueError: inconsistent shapes

Looking at the printed Y_test, the shape is different than the rest. What am i doing wrong and why does MultiLabelBinarizer return a different size for Y_test? Thanks for the help in advance!

Edit New error:

Traceback (most recent call last):
  File "C:/Users/../PycharmProjects/MultiAutoTag/classifier.py", line 47, in <module>
    Y_test = mlb.transform(y_test)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 763, in transform
    yt = self._transform(y, class_to_index)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 787, in _transform
    indices.extend(set(class_mapping[label] for label in labels))
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 787, in <genexpr>
    indices.extend(set(class_mapping[label] for label in labels))
KeyError: 'Sanction'

This is how y_test looks like:

print(y_test)

[['App'] ['Contract'] ['Pay'] ['App'] 
 ['App'] ['App']
 ['Reports'] ['Reports'] ['Executive', 'Pay']
 ['Change'] ['Reports']
 ['Reports'] ['Issue']]
rescot
  • 325
  • 2
  • 18

1 Answers1

2

You should only call transform() on test data. Never fit() or its variations like fit_transform() or fit_predict() etc. They should be used only on training data.

So change the line:

Y_test = mlb.fit_transform(y_test)

to

Y_test = mlb.transform(y_test)

Explanation:

When you call fit() or fit_transform(), the mlb forgets its previous learnt data and learn the new supplied data. This can be problematic when Y_train and Y_test may have difference in labels as your case have.

In your case, Y_train have 49 different kinds of labels, whereas Y_test have only 42 different labels. But this doesn't mean that Y_test is 7 labels short of Y_train. It can be possible that Y_test may have entirely different set of labels, which when binarized results in 42 columns, and that will affect the results.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thanks so much and for the explanation too! it works!! I have a new error but will accept it as the answer. thanks man – rescot Jun 21 '17 at 09:53
  • @otje You can ask the new error by editing this question or in a new question. – Vivek Kumar Jun 21 '17 at 09:55
  • i've editted the question with the new error. thanks for the help! – rescot Jun 21 '17 at 10:07
  • @otje This error means that there are some new labels in test which are not in train. That means the estimator will not learn to classify them. So how would you want to handle them? – Vivek Kumar Jun 21 '17 at 11:14
  • i've decided to StratifiedShuffleSplit use to equally share target classes for train and test. i have an instance with only one class which cannot be split, any althernate solution will be very much appreciated. thanks! [sklearn]: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit – rescot Jun 21 '17 at 14:35
  • @otje You see I was asking about what you would do in real world scenario in this case, when a new class emerges in real world data, which your algorithm has not trained for. If you are sure that would not happen, you can use Multilabelbinarizer on the whole (`y = mlb.fit_transform(y)`) (before the train_test_split) and then use it for training and testing. – Vivek Kumar Jun 22 '17 at 02:35