-1

I am evaluating different classifiers for my sentiment analysis model. I am looking at all available metrics, and whilst most achieve a similar precision, recall, F1-scores and ROC-AUC scores, Linear SVM appears to get a perfect ROC-AUC score. Look at the chart below:

enter image description here

Abbreviations: MNB=Multinomial Naive Bayes, SGD=Stochastic Gradient Descent, LR=Logistic Regression, LSVC=Linear Support Vector Classification

Here are the rest of the performance metrics for LSVC, which are very similar to the rest of the classifiers:

             precision    recall  f1-score   support

        neg       0.83      0.90      0.87     24979
        pos       0.90      0.82      0.86     25021

avg / total       0.87      0.86      0.86     50000

As you can see the dataset is balanced for pos and neg comments.

Here is the relevant code:

def evaluate(classifier):
    predicted = classifier.predict(testing_text)
    if isinstance(classifier.steps[2][1], LinearSVC):
        probabilities = np.array(classifier.decision_function(testing_text))
        scores = probabilities
    else:
        probabilities = np.array(classifier.predict_proba(testing_text))
        scores = np.max(probabilities, axis=1)

    pos_idx = np.where(predicted == 'pos')
    predicted_true_binary = np.zeros(predicted.shape)
    predicted_true_binary[pos_idx] = 1
    fpr, tpr, thresholds = metrics.roc_curve(predicted_true_binary, scores)
    auc = metrics.roc_auc_score(predicted_true_binary, scores)

    mean_acc = np.mean(predicted == testing_category)
    report = metrics.classification_report(testing_category, predicted)
    confusion_matrix = metrics.confusion_matrix(testing_category, predicted)

    return fpr, tpr, auc, mean_acc, report, confusion_matrix

I am using predict_proba for all classifiers apart from LSVC which uses decision_function instead (since it does not have a predict_proba method`)

What's going on?

EDIT: changes according to @Vivek Kumar's comments:

def evaluate(classifier):
    predicted = classifier.predict(testing_text)
    if isinstance(classifier.steps[2][1], LinearSVC):
        probabilities = np.array(classifier.decision_function(testing_text))
        scores = probabilities
    else:
        probabilities = np.array(classifier.predict_proba(testing_text))
        scores = probabilities[:, 1]  # NEW

    testing_category_array = np.array(testing_category)  # NEW
    pos_idx = np.where(testing_category_array == 'pos')
    predicted_true_binary = np.zeros(testing_category_array.shape)
    predicted_true_binary[pos_idx] = 1
    fpr, tpr, thresholds = metrics.roc_curve(predicted_true_binary, scores)
    auc = metrics.roc_auc_score(predicted_true_binary, scores)

    mean_acc = np.mean(predicted == testing_category)
    report = metrics.classification_report(testing_category, predicted)
    confusion_matrix = metrics.confusion_matrix(testing_category, predicted)

    return fpr, tpr, auc, mean_acc, report, confusion_matrix

This now yields this graph:

enter image description here

turnip
  • 2,246
  • 5
  • 30
  • 58
  • You are using the predicted values for both `predicted_true_binary` and `scores`. Essentially you are comparing predictions with predictions in `roc_curve` and `roc_auc_score` whereas actually it should have true labels (`testing_category`) as the first argument. – Vivek Kumar Feb 14 '18 at 11:50
  • Secondly, the `scores` should ideally be the probabilities of the positive class, not the maximum probability as you are doing in `np.max()`. – Vivek Kumar Feb 14 '18 at 11:52
  • 1) I see what you mean, I will correct that. 2) `predicted_proba` returns a list of `(pos_prob, neg_prob)` - I am doing `max` so that the probability matches the category. Otherwise I'd end up with `neg` categories corresponding to `pos` probabilities, no? – turnip Feb 14 '18 at 11:58
  • max will return the maximum of two. It will not tell you that its the probability of pos or neg. – Vivek Kumar Feb 14 '18 at 12:07
  • Yep, I think I understand now. After making both changes, here is the new graph: https://i.imgur.com/MHerBgB.png – turnip Feb 14 '18 at 12:16
  • Great. But I cant say unless I see the new complete code. – Vivek Kumar Feb 14 '18 at 12:30
  • 1
    @VivekKumar I have edited my post with the new changes – turnip Feb 14 '18 at 12:36
  • 1
    Good. Seems correct. No need to wrap probabilities in np.array. They are already np arrays. You can now add this as an answer and accept instead of edit the question. – Vivek Kumar Feb 14 '18 at 12:47
  • Feel free to post this as answer so you can get the rep; if you wish otherwise I will do it myself. Thanks again. – turnip Feb 14 '18 at 12:49

1 Answers1

-1

I don't think it is valid to compare the methods predict_proba and decision_function like for like. The first sentence in the docs for LSVC decision function "Predict confidence scores for samples." must not be read as "predicting probabilties". The second sentences clarifies it, it is similar to the decision function for the general SVC.

You can use predict_proba for a linear SVC with sklearn; then you need to specific under the general SVC the kernel as 'linear'. However, then you are changing the implementation under the hood (away from "LIBLINEAR").

Marcel Flygare
  • 837
  • 10
  • 19
  • No, its fine. The `roc_curve` and `roc_auc_score` can take and handle both (decision function confidence values and predict_proba probabilities). – Vivek Kumar Feb 14 '18 at 12:15