1

I was researching a Kaggle competition and used a Logistic Regression classifier to test the top 10 competitiors' approaches.

Link to the competition: https://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard

I'm still fairly new to the classification problems so I just tested classifiers without too much modifications. In this case I used scikit-learn's logreg. I cleaned the test/train data and used it to generate a ROC curve.

My area under the curve was 0.89 which would have placed me in 1st place with a significant lead and this seems quite impossible to me considering my implementation's simplicity. Could someone tell me if my program is doing something incorrectly that gives such a score (Ex. somehow overfitting or bug in code)?

import csv
import preprocessor as p
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

path = "C:\\Users\\Mike\\Desktop"

def vectorize_dataset(subpath, stem, vectorizer):
    comments = []
    labels = []
    stemmer = SnowballStemmer("english")
    with open(path + subpath + '.csv', 'r') as f:
        data_csv = csv.reader(f)

        for row in data_csv:
            clean_txt = p.clean(row[2])
            clean_txt = clean_txt.strip().replace('"', '').replace('\\\\', '\\').replace('_', ' ')
            clean_txt = bytes(clean_txt, 'utf-8').decode('unicode_escape', 'ignore')
            if stem:
                clean_txt = [stemmer.stem(word.lower()) for word in word_tokenize(clean_txt)]
            clean_txt = [word for word in clean_txt if word.isalpha()]      
            clean_txt = " ".join(clean_txt)

            if clean_txt != "":
                if row[0] == str(1) or row[0] == str(0):
                    comments.append(clean_txt) 
                    labels.append(int(row[0]))
    if subpath == "\\train":
        return (vectorizer.fit_transform(comments), labels)
    return (vectorizer.transform(comments), labels) 

def print_auroc_for_classifier(vect_tuple, classifier):
    y_true, y_score = [], []

    for sample, label in zip(vect_tuple[0], vect_tuple[1]):
        y_true.append(label)
        y_score.append(classifier.predict_proba(sample)[0][1])

    fpr, tpr, thresholds = roc_curve(y_true, y_score)
    roc_auc = auc(fpr, tpr)
    print("ROC AUC: %.2f" % roc_auc)  

    plt.plot(fpr, tpr)

if __name__ == '__main__':     
    plt.figure()
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')

    vectorizer = TfidfVectorizer()
    train_tuple = vectorize_dataset('\\train', True, vectorizer)
    test_tuple = vectorize_dataset('\\test', True, vectorizer)

    logreg = linear_model.LogisticRegression(C=7)
    logreg.fit(train_tuple[0].toarray(), train_tuple[1])

    print_auroc_for_classifier(test_tuple, logreg)

Instructions:

  1. From the Kaggle link download the train.csv and test_with_solutions.csv. https://www.kaggle.com/c/detecting-insults-in-social-commentary/data
  2. Rename test_with_solutions.csv to test.csv
  3. In code set path to be the path to the .csv files

For the C parameter I do not understand it too much and if it is the reason my score is this high, please let me know and I appreciate any advice in finding a good value for it. Thanks.

The approach:

  1. Read .csv file and clean the text (used preprocessor package and manually replaced certain characters)
  2. Used Snowball stemmer and check each word isalpha()
  3. Vectorize the test and train data using scikit-learn's TfidfVectorizer
  4. Train logreg with training data
  5. Calculate and plot ROC curve

Edit:

So I played around with the C parameter and setting C to a high value such as 1e5 gives me a lower ROC curve area. Perhaps now the main question is, should I be optimizing C to give me the highest ROC curve area assuming my code is correct and C was the parameter I needed to tune?

Edit2: I used GridSearchSV to test C in range of 0.1 to 10 and still got high results (going past 10 and below 0.1 didnt do anything).

Mike
  • 65
  • 8

1 Answers1

0

You are using different testing data than would have been available - use just the test.csv file to find the best model and value for C, then evaluate it only on the impermium_verification_set.csv. When the competition was running, looks like only test was available to find a model, then models were locked and leaderboard was based on the verification set. You are using the full set of both to select the best model.

You can always ask on the discussion boards on the Kaggle competition page if you want - I'm sure people there will help also. Also some of the top placers, including the winner, have posted their code on the discussion page for interest.

Ken Syme
  • 3,532
  • 2
  • 17
  • 19
  • Thank you so much. I used the imperium_verification_set.csv and got a ROC of 0.78. So test_with_solutions.csv had both test.csv and imperium_verification_set.csv? Also you mentioned test was use to find a model... did you mean train? – Mike Dec 11 '17 at 21:18
  • Yes I believe it has both sets included. When tuning a model you can do cross validation with the training set, I more meant if you had trained many models, the ones you would submit for the competition would likely be those with the highest test set score. – Ken Syme Dec 12 '17 at 07:35