0

I'm creating a classifier that takes vectorized book text as input and as output predicts whether the book is "good" or "bad".

I have 40 books, 27 good and 13 bad. I split each book into 5 records (5 ten-page segments) to increase the amount of data, so 200 records total.

Ultimately, I'll fit the model on all the books and use it to predict unlabeled books.

What's the best way to estimate the accuracy my model's going to have? I'll also use this estimate for model comparison, tuning, etc.

The two options I'm thinking of:

  1. Run a loop to test-train split the model X times and look at the accuracy for each split
  2. use cross-validation (GroupKFold specifically so that the 5 records for each book are kept together, since if not that would be major leakage)

I want to estimate the accuracy within a small margin of error as quickly as possible. Repeated train-test splits are slower, since even when I stratify by label (choosing 8 good books and 4 bad books for test) the accuracy for a particular model can vary from 0.6 to 0.8, so I'd have to run a lot to get an accurate estimate.

CV, on the other hand, is giving me the same score every time I run it, and seems to line up relatively well with the average accuracies of the models after 100 train-test splits (within 1-1.5%).

CV is much faster, so I'd prefer to use it. Does CV make sense to use here? I'm currently using 5-fold (so it's choosing 8 holdout books each run, or 40 holdout records total).

Also, should CV be giving the exact same accuracy every time I run it? (and exact same list of accuracies in the same order, for that matter). I'm shuffling my corpus before putting X, y, and groups into the cross_val_score. Would a ShuffleSplit be preferable? Here's my code:

for i in range(0,5):
    dfcopy = df.copy()
    dfcopy = dfcopy.sample(frac=1).reset_index(drop=True)
    X, y = dfcopy.text, dfcopy.label
    groups = dfcopy.title.tolist()

    
    model = MultinomialNB()
    name = 'LR'

    pipe = Pipeline([('cleaner', clean_transformer()),
                     ('vectorizer', bow_vector),
                     ('classifier', model)])

    score = cross_val_score(estimator=pipe, X=X, y=y, groups=groups, cv=GroupKFold())
    print(score)
    print(np.mean(score))

Finally, should I be using stratification? My thought was that I should since I effectively have 40 items to be split between train and test, so the test set (chosen randomly) could reasonably end up being all/mostly good or all/mostly bad, and I didn't think that would be a good test set for representing accuracy.

rbb
  • 89
  • 1
  • 5

1 Answers1

0

I will try to go in order:

  • What's the best way to estimate the accuracy my model's going to have? I'll also use this estimate for model comparison, tuning, etc.

  • CV is much faster, so I'd prefer to use it. Does CV make sense to use here?

If your folds are very similar between each other there will be no big difference between N-fold CV, and repeated test and train.

  • Should CV be giving the exact same accuracy every time I run it?

It depends on two factors, the hyperparameters and the data used, MultinomialNB have very little space of improvement with its hyperparameters. Therefore it comes down to the distribution of CV folds.

  • Would a ShuffleSplit be preferable?

ShuffleSplit might make some difference but do not expect huge differences.

As I see, at least in my experience, the big step up you could make is stop using MultinomialNB - which although being a good baseline will not deliver you crazy good results - and start using something a little bit more sophisticated, like SGDClassifier, Random Forest, Perceptron, you name it. Using scikit-learn it is rather easy to switch between a classification algorithm to another, thanks to the very good work in standardising calls and data used up to now. Therefore your model would become:

model = RandomForestClassifier()

One more thing which might be helpful is using train/test/validate set and hyperparameter optimisation, like Gridsearch, the set up might take you a couple hours but it will certainly pay off.

If you decide to use train/test/validate, scikit-learn got you covered with the train_test_split function:

X, y = df.text, df.label    

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

If you decide to use gridsearch for hyperparameter optimisation, you will need to:

(1) define your set of possible parameters

grid_1 = { 
  "n_estimators": [100,200,500],
  "criterion": ["gini", "entropy"],
  "max_features": ['sqrt','log2',0.2,0.5,0.8],
  "max_depth": [3,4,6,10],
  "min_samples_split": [2, 5, 20,50] 
}

(2) launch the grid search optimisation

model = RandomForestClassifier()
grid_search = GridSearchCV(model, grid_1, n_jobs=-1, cv=5)
grid_search.fit(X_train, Y_train)

Gridsearch is pretty simple as optimisation technique, but will be very helpful delivering better results. If you want to deepen you understanding of this topic and further enhance your code you can find an example code using more sophisticated Hyperparameter optimisation strategies like TPE here

Finally, your datasets seem to be pretty small, if you are experiencing long waiting times between a train and another, I would suggest you considering writing a little cache system in order to cut off loading and processing times. You can find an example code using a little cache system here

  • Thank you for the detailed reply! MultinomialNB() was just the example model that I had in that slot to show off the cross-validation code; in my real notebook I test a bunch of different models. I'll look into the cache system, and have been using gridsearch for hyperparameter tuning. I was mainly concerned with whether train_test_split or cross_val_score is a better estimator of the accuracy my model will have on new data. – rbb Jul 23 '20 at 19:38
  • You are welcome! The answer then is yes, cross validation should be better in delivering you stable performances, but in my experience data, and data distribution are key in this estimation and therefore it can happen that standard train-test-validate procedure will work as good as cross validation. – Williams Rizzi Jul 24 '20 at 08:15