I was researching a Kaggle competition and used a Logistic Regression classifier to test the top 10 competitiors' approaches.
Link to the competition: https://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard
I'm still fairly new to the classification problems so I just tested classifiers without too much modifications. In this case I used scikit-learn's logreg. I cleaned the test/train data and used it to generate a ROC curve.
My area under the curve was 0.89 which would have placed me in 1st place with a significant lead and this seems quite impossible to me considering my implementation's simplicity. Could someone tell me if my program is doing something incorrectly that gives such a score (Ex. somehow overfitting or bug in code)?
import csv
import preprocessor as p
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
path = "C:\\Users\\Mike\\Desktop"
def vectorize_dataset(subpath, stem, vectorizer):
comments = []
labels = []
stemmer = SnowballStemmer("english")
with open(path + subpath + '.csv', 'r') as f:
data_csv = csv.reader(f)
for row in data_csv:
clean_txt = p.clean(row[2])
clean_txt = clean_txt.strip().replace('"', '').replace('\\\\', '\\').replace('_', ' ')
clean_txt = bytes(clean_txt, 'utf-8').decode('unicode_escape', 'ignore')
if stem:
clean_txt = [stemmer.stem(word.lower()) for word in word_tokenize(clean_txt)]
clean_txt = [word for word in clean_txt if word.isalpha()]
clean_txt = " ".join(clean_txt)
if clean_txt != "":
if row[0] == str(1) or row[0] == str(0):
comments.append(clean_txt)
labels.append(int(row[0]))
if subpath == "\\train":
return (vectorizer.fit_transform(comments), labels)
return (vectorizer.transform(comments), labels)
def print_auroc_for_classifier(vect_tuple, classifier):
y_true, y_score = [], []
for sample, label in zip(vect_tuple[0], vect_tuple[1]):
y_true.append(label)
y_score.append(classifier.predict_proba(sample)[0][1])
fpr, tpr, thresholds = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)
print("ROC AUC: %.2f" % roc_auc)
plt.plot(fpr, tpr)
if __name__ == '__main__':
plt.figure()
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
vectorizer = TfidfVectorizer()
train_tuple = vectorize_dataset('\\train', True, vectorizer)
test_tuple = vectorize_dataset('\\test', True, vectorizer)
logreg = linear_model.LogisticRegression(C=7)
logreg.fit(train_tuple[0].toarray(), train_tuple[1])
print_auroc_for_classifier(test_tuple, logreg)
Instructions:
- From the Kaggle link download the train.csv and test_with_solutions.csv. https://www.kaggle.com/c/detecting-insults-in-social-commentary/data
- Rename test_with_solutions.csv to test.csv
- In code set
path
to be the path to the .csv files
For the C
parameter I do not understand it too much and if it is the reason my score is this high, please let me know and I appreciate any advice in finding a good value for it. Thanks.
The approach:
- Read .csv file and clean the text (used preprocessor package and manually replaced certain characters)
- Used Snowball stemmer and check each word isalpha()
- Vectorize the test and train data using scikit-learn's TfidfVectorizer
- Train logreg with training data
- Calculate and plot ROC curve
Edit:
So I played around with the C parameter and setting C to a high value such as 1e5 gives me a lower ROC curve area. Perhaps now the main question is, should I be optimizing C to give me the highest ROC curve area assuming my code is correct and C was the parameter I needed to tune?
Edit2: I used GridSearchSV to test C in range of 0.1 to 10 and still got high results (going past 10 and below 0.1 didnt do anything).