8

I am trying code from this page. I ran up to the part LR (tf-idf) and got the similar results

After that I decided to try GridSearchCV. My questions below:

1)

#lets try gridsearchcv
#https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)

#tuned hpyerparameters :(best parameters)  {'C': 10.0, 'penalty': 'l2'}
#best score : 0.7390325593588823

Then I calculated f1 score manually. why it is not matching?

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]
final_prediction=np.where(logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]>=0.5,1,0)
#https://www.statology.org/f1-score-in-python/
from sklearn.metrics import f1_score
#calculate F1 score
f1_score(y_train, final_prediction)
0.9839388145315489
  1. If I try scoring='precision' why does it give below error? I am not clear mainly because I have relatively balanced dataset (55-45%) and f1 which requires precision is getting calculated without any problems

#lets try gridsearchcv #https://www.kaggle.com/enespolat/grid-search-with-logistic-regression

from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l2"]}# l1 lasso l2 ridge
logreg=LogisticRegression(solver = 'liblinear')
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='precision')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

print("tuned hpyerparameters :(best parameters) ",logreg_cv.best_params_)
print("best score :",logreg_cv.best_score_)



/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
/usr/local/lib/python3.7/dist-packages/sklearn/metrics/_classification.py:1308: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior.
  _warn_prf(average, modifier, msg_start, len(result))
tuned hpyerparameters :(best parameters)  {'C': 0.1, 'penalty': 'l2'}
best score : 0.9474200393672962
  1. is there any easier way to get predictions on the train data back? we already have the logreg_cv object. I used below method to get the predictions back. Is there a better way to do the same?

logreg_cv.predict_proba(X_train_vectors_tfidf)[:,1]

############################

############update 1

  1. Please answer question 1 from above. In the comment for the question it says The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search

is that the reason I get below results #tuned hpyerparameters :(best parameters) {'C': 10.0, 'penalty': 'l2'} #best score : 0.7390325593588823

but when i do manually i get f1_score(y_train, final_prediction) 0.9839388145315489

2)

I tried to tune using f1_micro as suggested in the answer below. No error message. I am still not clear why f1_micro is not failing when precision fails

from sklearn.model_selection import GridSearchCV

grid={"C":np.logspace(-3,3,7), "penalty":["l2"], "solver":['liblinear','newton-cg'], 'class_weight':[{ 0:0.95, 1:0.05 }, { 0:0.55, 1:0.45 }, { 0:0.45, 1:0.55 },{ 0:0.05, 1:0.95 }]}# l1 lasso l2 ridge
#logreg=LogisticRegression(solver = 'liblinear')
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=3,scoring='f1_micro')
logreg_cv.fit(X_train_vectors_tfidf, y_train)

tuned hpyerparameters :(best parameters)  {'C': 10.0, 'class_weight': {0: 0.45, 1: 0.55}, 'penalty': 'l2', 'solver': 'newton-cg'}
best score : 0.7894909688013136
user2543622
  • 5,760
  • 25
  • 91
  • 159
  • 3
    The best score in GridSearchCV is calculated by taking the average score from cross validation for the best estimators. That is, it is calculated from data that is held out during fitting. From what I can tell, you are calculating predicted values from the training data and calculating an F1 score on that. Since the model was trained on that data, that is why the F1 score is so much larger compared to the results in the grid search. – cazman Dec 07 '21 at 17:21
  • 1
    On number 2, that is a warning, not an error. It is telling you that there are some labels in y_train that were not predicted, so the precision is 0. – cazman Dec 07 '21 at 17:38
  • i have binary classification with around 55%-45% split. Why wouldnt it predict one of those labels? Also f1 score works without any problem and f1 score needs precision – user2543622 Dec 07 '21 at 17:45
  • 1
    The model may not be predicting one of the classes very well. You can test this with `set(y_train) - set(final_prediction)`. If the result is not an empty set, then the model isn't predicting that label. As for the discrepancy, I'm not sure without seeing the data, but you can make the models more reproducible by including `random_state=` when you create the `LogisticRegression` instance. – cazman Dec 07 '21 at 18:39
  • as I have 55-45% split, both labels are being predicted. and my earlier question still remains - f1 score works without any problem and f1 score needs precision so precision on its own should work – user2543622 Dec 07 '21 at 21:26

1 Answers1

5

You end up with the error with precision because some of your penalization is too strong for this model, if you check the results, you get 0 for f1 score when C = 0.001 and C = 0.01

res = pd.DataFrame(logreg_cv.cv_results_)
res.iloc[:,res.columns.str.contains("split[0-9]_test_score|params",regex=True)]
 
                           params  split0_test_score  split1_test_score  split2_test_score
0   {'C': 0.001, 'penalty': 'l2'}           0.000000           0.000000           0.000000
1    {'C': 0.01, 'penalty': 'l2'}           0.000000           0.000000           0.000000
2     {'C': 0.1, 'penalty': 'l2'}           0.973568           0.952607           0.952174
3     {'C': 1.0, 'penalty': 'l2'}           0.863934           0.851064           0.836449
4    {'C': 10.0, 'penalty': 'l2'}           0.811634           0.769547           0.787838
5   {'C': 100.0, 'penalty': 'l2'}           0.789826           0.762162           0.773438
6  {'C': 1000.0, 'penalty': 'l2'}           0.781003           0.750000           0.763871

You can check this:

lr = LogisticRegression(C=0.01).fit(X_train_vectors_tfidf,y_train)
np.unique(lr.predict(X_train_vectors_tfidf))
array([0])

And that the probabilities predicted drift towards the intercept:

# expected probability
np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))
array([0.41764462])

lr.predict_proba(X_train_vectors_tfidf)
 
array([[0.58732636, 0.41267364],
       [0.57074279, 0.42925721],
       [0.57219143, 0.42780857],
       ...,
       [0.57215605, 0.42784395],
       [0.56988186, 0.43011814],
       [0.58966184, 0.41033816]])

For the question on "get predictions on the train data back", i think that's the only way. The model is refitted on the whole training set using the best parameters, but the predictions or predicted probabilities are not stored. If you are looking for the values obtained during train / test, you can check cross_val_predict

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks! Your answer makes sense. 1) But why dont we get that error when we use `f1` and get only when we tune `precision`? 2) why are you doing `np.exp(lr.intercept_)/(1+np.exp(lr.intercept_))`? is it to calculate probability when all x coefficients are 0? 3) I tuned the model using `f1 score` and i got below recommendation `tuned hpyerparameters :(best parameters) {'C': 10.0, 'class_weight': {0: 0.45, 1: 0.55}, 'penalty': 'l2', 'solver': 'liblinear'}` Due you think that it is a very high penalty? `best score : 0.7445210598782159` – user2543622 Dec 10 '21 at 03:45
  • 1
    yes you convert the intercept from logit to probability. No it's not high. The C parameter is the inverse of regularization. The higher your C, the weaker the regularization or penalty – StupidWolf Dec 10 '21 at 06:13
  • 1
    Also you are only getting the warnings for some values of C. In the example above, you are predicting with the best params from gridsearchcv , so it's using a C that is definitely not 0.001 or 0.01. If you rerun your search with scoring='f1_micro' you will see the error – StupidWolf Dec 10 '21 at 06:21
  • Could you reply my update 1? I found your answer very helpful! Thanks – user2543622 Dec 10 '21 at 14:15