I'm creating a classifier that takes vectorized book text as input and as output predicts whether the book is "good" or "bad".
I have 40 books, 27 good and 13 bad. I split each book into 5 records (5 ten-page segments) to increase the amount of data, so 200 records total.
Ultimately, I'll fit the model on all the books and use it to predict unlabeled books.
What's the best way to estimate the accuracy my model's going to have? I'll also use this estimate for model comparison, tuning, etc.
The two options I'm thinking of:
- Run a loop to test-train split the model X times and look at the accuracy for each split
- use cross-validation (GroupKFold specifically so that the 5 records for each book are kept together, since if not that would be major leakage)
I want to estimate the accuracy within a small margin of error as quickly as possible. Repeated train-test splits are slower, since even when I stratify by label (choosing 8 good books and 4 bad books for test) the accuracy for a particular model can vary from 0.6 to 0.8, so I'd have to run a lot to get an accurate estimate.
CV, on the other hand, is giving me the same score every time I run it, and seems to line up relatively well with the average accuracies of the models after 100 train-test splits (within 1-1.5%).
CV is much faster, so I'd prefer to use it. Does CV make sense to use here? I'm currently using 5-fold (so it's choosing 8 holdout books each run, or 40 holdout records total).
Also, should CV be giving the exact same accuracy every time I run it? (and exact same list of accuracies in the same order, for that matter). I'm shuffling my corpus before putting X, y, and groups into the cross_val_score. Would a ShuffleSplit be preferable? Here's my code:
for i in range(0,5):
dfcopy = df.copy()
dfcopy = dfcopy.sample(frac=1).reset_index(drop=True)
X, y = dfcopy.text, dfcopy.label
groups = dfcopy.title.tolist()
model = MultinomialNB()
name = 'LR'
pipe = Pipeline([('cleaner', clean_transformer()),
('vectorizer', bow_vector),
('classifier', model)])
score = cross_val_score(estimator=pipe, X=X, y=y, groups=groups, cv=GroupKFold())
print(score)
print(np.mean(score))
Finally, should I be using stratification? My thought was that I should since I effectively have 40 items to be split between train and test, so the test set (chosen randomly) could reasonably end up being all/mostly good or all/mostly bad, and I didn't think that would be a good test set for representing accuracy.