0

I am trying to train a classifier to detect imperatives. There are 2000 imperatives and 2000 non-imperatives in my data. I used 10% of 4000 (400) to be my Test set, and the rest of 3600 sentences as Training set for the classifiers. I tried to apply the concept of K-Fold Cross Validation. Part of my code is below:

featuresets = [(document_features(d, word_features), c) for (d, c) in train]
train_set, test_set = featuresets[360:], featuresets[:360] 
#first 360 (first 10% of the data)sentences be the first test_set 

classifier = nltk.NaiveBayesClassifier.train(train_set)
a=nltk.classify.accuracy(classifier, test_set)

train_set2, test_set2= featuresets[:360]+featuresets[720:], 
featuresets[360:720] #second 10% of the sentences to be the second test_set 
classifier2 = classifier.train(train_set2)
b=nltk.classify.accuracy(classifier2, test_set2)

train_set3, test_set3 = featuresets[:720]+featuresets[1080:], 
featuresets[720:1080]
#Third 10% of the data be the third test_set 
classifier3 = classifier2.train(train_set3)
c=nltk.classify.accuracy(classifier3, test_set3)

train_set4, test_set4 = featuresets[:1080]+featuresets[1440:], 
featuresets[1080:1440]
#Fourth 10% of the data be the Fourth test_set 
classifier4 = classifier3.train(train_set4)
d=nltk.classify.accuracy(classifier4, test_set4)

I repeated the same training act for 10 times (I only showed 4 times in my code) because 10 different parts of data need to be validation data at least once for K-folds cross validation.

The question I have here is I don't know if each time I should create a new classifier (classifier = nltk.NaiveBayesClassifier.train(train_set)), train it and calculate the average accuracy score from each of the individual classifiers to be accuracy score. Or I should just train the previously trained classifier with the new data (just like what I do now) so the last classifier will be the one trained 10 times?

smci
  • 32,567
  • 20
  • 113
  • 146
  • 1
    No you don't need to create a new classifier. If you create an object `classifier = nltk.NaiveBayesClassifier.train(train_set)` every time a new classifier will be made which will not be the same one and that would completely be opposite to the purpose of K-Fold cross validation in which we train one model with different proportions of our train data. – Rex5 Aug 13 '19 at 03:53
  • 1
    No you don't need to. Also, your manual K-fold code can all be replaced by using the convenience functions [`sklearn.model_selection.KFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) or [`sklearn.model_selection.StratifiedKFold`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) in your case to preserve the class ration inside each fold. Please see scikit-learn tutorials. – smci Aug 13 '19 at 03:56
  • do see the links provided by @smci, they are quite useful. – Rex5 Aug 13 '19 at 04:18
  • @smci Thank you for providing the tutorials. If I have any questions writing a k-fold function code, I will post a question again. Thanks a lot! – Zong-Ying Aug 13 '19 at 04:19
  • I meant 'class ratio' not 'ration', that's obviously a typo... – smci Aug 13 '19 at 04:23

0 Answers0