0

I am testing a multi-label classification problem using textual features. I have a total of 1503 text documents. My model shows slight variations in the results each time I run the script manually. I am not sure if my model overfits or if this is normal as I am a beginner.

http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html

I have built the model using the exact script found in the following blog. The one variation is that I use Linear SVC from scikit learn

My accuracy score varies between 89 and 90 and Kappa between 87 and 88.Should some modifications be done to make it stable?

This is a sample for 2 manual runs

Total emails classified: 1503
F1 Score: 0.902158940397
classification accuracy: 0.902158940397
kappa accuracy: 0.883691169128


             precision    recall  f1-score   support

      Arts      0.916     0.878     0.897       237
     Music      0.932     0.916     0.924       238
      News      0.828     0.876     0.851       242
  Politics      0.937     0.900     0.918       230
   Science      0.932     0.791     0.855        86
    Sports      0.929     0.948     0.938       233
Technology      0.874     0.937     0.904       237

avg / total     0.904     0.902     0.902      1503


Second run
Total emails classified: 1503
F1 Score: 0.898181015453
classification accuracy: 0.898181015453
kappa accuracy: 0.879002051427

Given below is the code

def compute_classification(): 


#- 1. Load dataset
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification))
data = data.reindex(numpy.random.permutation(data.index))

#- 2. Apply different classification methods

"""
SVM
"""
pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))


])

#- 3. Perform K Fold Cross Validation
k_fold = KFold(n=len(data), n_folds=10)
f_score    = []
c_accuracy = []
k_score    = []
confusion  = numpy.array([[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]])
y_predicted_overall = None
y_test_overall      = None

for train_indices, test_indices in k_fold:

    train_text = data.iloc[train_indices]['text'].values
    train_y    = data.iloc[train_indices]['class'].values.astype(str)
    test_text  = data.iloc[test_indices]['text'].values
    test_y     = data.iloc[test_indices]['class'].values.astype(str)


    # Train the model
    pipeline.fit(train_text, train_y)

    # Predict test data
    predictions = pipeline.predict(test_text)

    confusion += confusion_matrix(test_y, predictions, binary=False)
    score = f1_score(test_y, predictions, average='micro')
    f_score.append(score)
    caccuracy = metrics.accuracy_score(test_y, predictions)
    c_accuracy.append(caccuracy)
    kappa = cohen_kappa_score(test_y, predictions)
    k_score.append(kappa)

    # collect the y_predicted per fold
    if y_predicted_overall is None:
        y_predicted_overall = predictions
        y_test_overall = test_y
    else: 
        y_predicted_overall = numpy.concatenate([y_predicted_overall, predictions])
        y_test_overall = numpy.concatenate([y_test_overall, test_y])

# Print Metrics
print_metrics(data,k_score,c_accuracy,y_predicted_overall,y_test_overall,f_score,confusion)

return pipeline
VKB
  • 65
  • 1
  • 7

1 Answers1

2

You are seeing variation because LinearSVC uses a random number generator when fitting:

The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter.

You can also try setting the random_state parameter. In fact, most sklearn objects that use a random number generator take random_state as an optional parameter. You can pass either an instance of RandomState or an int seed:

pipeline = Pipeline([

# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf',       LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-5, random_state=42))

])

EDIT: as mentioned in the comments, cross_validation.KFold also takes a random_state parameter to determine how to segregate the data. To ensure reproducibility, you should also pass a seed or RandomState to KFold.

ON SECOND THOUGHT: documentation for KFold suggests that the default is to not randomize the splits unless shuffle=True is also specified, so I don't know if the above suggestion will help.

AS A SIDE-NOTE: cross_validation.KFold has been deprecated since version 0.18, so I would recommend using model_selection.KFold instead:

from sklearn.model_selection import KFold
k_fold = KFold(n_splits=10, random_state=42)
...
for train_indices, test_indices in k_fold.split(data):
PaSTE
  • 4,050
  • 18
  • 26
  • that did not work and I am facing the same problem using naive bayes as well('clf', MultinomialNB(alpha=.01)). I tried populating the top 10 features from each category and most of the values are negative(-0.44165490669 -0.20471658491 -0.422944296586 -0.456577163343 -0.149703530298 -0.353109758872 -0.0361366497467 -0.105397140396 -0.264185671137 -0.25398199818 -0.151967985751 -0.190810193788 -0.37292489701 -0.132826347092) and I find it strange. Could it be due to this reason and how come my accuracy is this much when the top features are all negative values – VKB Jul 21 '17 at 03:37
  • 1
    In addition to adding `random_state` to LinearSVC, add `random_state` to KFold also, because the indices generated by it are dependent on it too. Also @VKB How are you selecting the top 10 features? – Vivek Kumar Jul 21 '17 at 03:42
  • @VivekKumar you could find my code here:[link](https://stackoverflow.com/questions/45190708/creating-n-grams-word-cloud-using-python) I just added this line in addition to get values (coef1 = pipeline.named_steps['clf'].coef_.ravel()) and I print this value by running a loop. – VKB Jul 21 '17 at 04:11
  • @VivekKumar I am sorry. I flattened it out using ravel. I am getting positive values. Adding random state only slows down the computation time but i receive the same minor fluctuations. If it is ok to get a minor variation, then which accuracy should I take as final – VKB Jul 21 '17 at 04:54