GridSearchCV scoring parameter: using scoring='f1' or scoring=None (by default uses accuracy) gives the same result

Question

I'm using an example extracted from the book "Mastering Machine Learning with scikit learn".

It uses a decision tree to predict whether each of the images on a web page is an advertisement or article content. Images that are classified as being advertisements could then be hidden using Cascading Style Sheets. The data is publicly available from the Internet Advertisements Data Set: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements, which contains data for 3,279 images.

The following is the complete code for completing the classification task:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
import sys,random

def main(argv):
    df = pd.read_csv('ad-dataset/ad.data', header=None)
    explanatory_variable_columns = set(df.columns.values)
    response_variable_column = df[len(df.columns.values)-1]


    explanatory_variable_columns.remove(len(df.columns.values)-1)
    y = [1 if e == 'ad.' else 0 for e in response_variable_column]
    X = df[list(explanatory_variable_columns)]

    X.replace(to_replace=' *\?', value=-1, regex=True, inplace=True)

    X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100000)

    pipeline = Pipeline([('clf',DecisionTreeClassifier(criterion='entropy',random_state=20000))])

    parameters = {
        'clf__max_depth': (150, 155, 160),
        'clf__min_samples_split': (1, 2, 3),
        'clf__min_samples_leaf': (1, 2, 3)
    }

    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,verbose=1, scoring='f1')
    grid_search.fit(X_train, y_train)
    print 'Best score: %0.3f' % grid_search.best_score_
    print 'Best parameters set:'
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print '\t%s: %r' % (param_name, best_parameters[param_name])

    predictions = grid_search.predict(X_test)
    print classification_report(y_test, predictions)


if __name__ == '__main__':
  main(sys.argv[1:])

The RESULTS of using scoring='f1' in GridSearchCV as in the example is:

F1 SCORE Results

The RESULTS of using scoring=None (by default Accuracy measure) is the same as using F1 score:

Accuracy SCORE Results

If I'm not wrong optimizing the parameter search by different scoring functions should yield different results. The following case shows that different results are obtained when scoring='precision' is used.

The RESULTS of using scoring='precision' is DIFFERENT than the other two cases. The same would be true for 'recall', etc:

Precision SCORE Results

WHY 'F1' AND None, BY DEFAULT ACCURACY, GIVE THE SAME RESULT??

EDITED

I agree with both answers by Fabian & Sebastian. The problem should be the small param_grid. But I just wanted to clarify that the problem surged when I was working with a totally different (not the one in the example here) highly imbalance dataset 100:1 (which should affect the accuracy) and using Logistic Regression. In this case also 'F1' and accuracy gave the same result.

The param_grid that I used, in this case, was the following:

parameters = {"penalty": ("l1", "l2"),
    "C": (0.001, 0.01, 0.1, 1, 10, 100),
    "solver": ("newton-cg", "lbfgs", "liblinear"),
    "class_weight":[{0:4}],
}

I guess that the parameter selection is also too small.

What version of sklearn are you using? – rabbit Oct 01 '15 at 15:07 — rabbit, Oct 01 '15 at 15:07
Hi @NBartley, the scikit-learn version I'm using is 0.16.1 – Pablo Fleurquin Oct 01 '15 at 15:16 — Pablo Fleurquin, Oct 01 '15 at 15:16
fitting a solver as a hyperparameter is rather redundant – lejlot Dec 22 '15 at 23:09 — lejlot, Dec 22 '15 at 23:09

score 3 · Accepted Answer · answered Oct 01 '15 at 20:54

I think that the author didn't choose this example very well. I may be missing something here, but min_samples_split=1 doesn't make sense to me: Isn't it the same as setting min_samples_split=2 since you can't split 1 sample -- essentially, it's a waste of computational time.

From the documentation: min_samples_split: "The minimum number of samples required to split an internal node."

Btw. this is a very small grid and there is not much choice anyways, which may explain why accuracy and f1 give you the same parameter combinations and hence the same scoring tables.

Like mentioned above, the dataset may be well balanced which is why F1 and accuracy scores may prefer the same parameter combinations. So, looking further at your GridSearch results using (a) F1 score and (b) Accuracy, I conclude that in both cases a depth of 150 works best. Since this is the lower boundary, it gives you a slight hind that lower "depth" values may work even better. However, I suspect that the tree doesn't even go that deep on this dataset (you can end up with "pure" leaves even well before reaching the max depth).

So, let's repeat the experiment with a little bit more sensible values using the following parameter grid

parameters = {
    'clf__max_depth': list(range(2, 30)),
    'clf__min_samples_split': (2,),
    'clf__min_samples_leaf': (1,)
}

The optimal "depth" for the best F1 score seems to be around 15.

Best score: 0.878
Best parameters set:
    clf__max_depth: 15
    clf__min_samples_leaf: 1
    clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.98      0.99      0.99       716
          1       0.92      0.89      0.91       104

avg / total       0.98      0.98      0.98       820

Next, let's try it using "accuracy" (or None) as our scoring metric:

> Best score: 0.967
Best parameters set:
    clf__max_depth: 6
    clf__min_samples_leaf: 1
    clf__min_samples_split: 2
             precision    recall  f1-score   support

          0       0.98      0.99      0.98       716
          1       0.93      0.85      0.88       104

avg / total       0.97      0.97      0.97       820

As you can see, you get different results now, and the "optimal" depth is different if you use "accuracy."

Thank you @Sebastian I added some extra info to the question. — Pablo Fleurquin, Oct 01 '15 at 23:00

score 1 · Answer 2 · answered Oct 01 '15 at 16:06

I don't agree that optimizing the parameter search by different scoring functions should yield necessarily different results necessarily. If your dataset is balanced (roughly same number of samples in each class), I would expect that model selection by accuracy and F1 would yield very similar results.

Also, have in mind that GridSearchCV optimizes over a discrete grid. Maybe using a thinner grid of parameters would yield the results that you are looking for.

Thank you @Fabian I added some extra info to the question. – Pablo Fleurquin Oct 01 '15 at 23:01 — Pablo Fleurquin, Oct 01 '15 at 23:01

score 0 · Answer 3 · answered Dec 22 '15 at 23:06

0

On an unbalanced dataset use the "labels" parameter of the f1_score scorer to use only the f1 score of the class you are interested in. Or consider using "sample_weight".

answered Dec 22 '15 at 23:06

Diego

812
7
25

GridSearchCV scoring parameter: using scoring='f1' or scoring=None (by default uses accuracy) gives the same result

WHY 'F1' AND None, BY DEFAULT ACCURACY, GIVE THE SAME RESULT??

EDITED

3 Answers3