How to tune maximum entropy's parameter?

Question

I am doing text classification with scikit learn's logistic regression function (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). I am using grid search in order to choose a value for the C parameter. Do I need to do the same for max_iter parameter? why?

Both C and max_iter parameters have default values in Sklearn, which means they need to be tuned. But, from what I understand, early stopping and l1/l2 regularization are two desperate methods for avoiding overfitting and performing one of them is enough. Am I incorrect in assuming that tunning the value of max_iter is equivalent to early stopping?

To summarize, here are my main questions:

1- Does max_iter need tuning? why? (the documentation says it is only useful for certain solvers)

2- Is tuning the max_iter equivalent to early stopping?

3- Should we perform early stopping and L1/L2 regularization at the same time?

More than programmatic issue, this question is related to algorithmic approach, which IMO is more suitable for https://stats.stackexchange.com — Vivek Kumar, Nov 24 '17 at 06:00
Delete this from here and then post the same thing again there. Or else weight for the mods to close and migrate this question. — Vivek Kumar, Nov 24 '17 at 06:12

score 1 · Accepted Answer · answered Nov 24 '17 at 08:43

Here's some simple responses to your numbered questions and grossly simplified:

Yes, sometimes you need to tune max_iter. Why? See next.
No. max_iter is the number of iterations that the logistic regression classifier's solver is allowed to step through before being stopped. The aim is to reach a "stable" solution for the parameters of the logistic regression model, i.e., it is an optimisation problem. If your max_iter is too low, you may not reach an optimal solution and your model is underfit. If your value is too high, you can essentially wait forever to have a solution for little gain in accuracy. You may also get stuck at local optima if max_iter is too low.
Yes or No.

a. L1/L2 regularisation is essentially "smoothing" of your complex model so that it does not overfit to the training data. If parameters become too large, they are penalised in the cost.

b. Early stopping is when you stop optimising your model (e.g., via gradient descent) at some stage in which you deem acceptable (before max_iter). For example, a metric such as RMSE can be used to define when to stop, or a comparison of the metrics from your test/training data.

c. When to use them? This is dependent on your problem. If you have a simple linear problem, with limited features, you will not need regularisation or early stopping. If you have thousands of features and experience overfitting then apply regularisation as one solution. If you do not want to wait for the optimisation to run to the end when you are playing with parameters as you only care about a certain level of accuracy, you could apply early stopping.

Finally, how do I tune max_iter correctly? This depends on your problem at hand. If you find your classification metric shows your model is performing poorly, it could be that your solver has not taken enough steps to reach a minimum. I'd suggest you do this by hand and look at the cost vs. max_iter to see if it is reaching a minimum properly rather than automate it.

How to tune maximum entropy's parameter?

1 Answers1