I am trying to train a classifier to detect imperatives. There are 2000 imperatives and 2000 non-imperatives in my data. I used 10% of 4000 (400) to be my Test set, and the rest of 3600 sentences as Training set for the classifiers. I tried to apply the concept of K-Fold Cross Validation. Part of my code is below:
featuresets = [(document_features(d, word_features), c) for (d, c) in train]
train_set, test_set = featuresets[360:], featuresets[:360]
#first 360 (first 10% of the data)sentences be the first test_set
classifier = nltk.NaiveBayesClassifier.train(train_set)
a=nltk.classify.accuracy(classifier, test_set)
train_set2, test_set2= featuresets[:360]+featuresets[720:],
featuresets[360:720] #second 10% of the sentences to be the second test_set
classifier2 = classifier.train(train_set2)
b=nltk.classify.accuracy(classifier2, test_set2)
train_set3, test_set3 = featuresets[:720]+featuresets[1080:],
featuresets[720:1080]
#Third 10% of the data be the third test_set
classifier3 = classifier2.train(train_set3)
c=nltk.classify.accuracy(classifier3, test_set3)
train_set4, test_set4 = featuresets[:1080]+featuresets[1440:],
featuresets[1080:1440]
#Fourth 10% of the data be the Fourth test_set
classifier4 = classifier3.train(train_set4)
d=nltk.classify.accuracy(classifier4, test_set4)
I repeated the same training act for 10 times (I only showed 4 times in my code) because 10 different parts of data need to be validation data at least once for K-folds cross validation.
The question I have here is I don't know if each time I should create a new classifier
(classifier = nltk.NaiveBayesClassifier.train(train_set)
), train it and calculate the average accuracy score from each of the individual classifiers to be accuracy score. Or I should just train the previously trained classifier with the new data (just like what I do now) so the last classifier will be the one trained 10 times?