7

I'm fitting a logistic regression model and am setting the random state to a fixed value.

Every time I do a "fit" I get different coefficients, example:

classifier_instance.fit(train_examples_features, train_examples_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=1, tol=0.0001)

>>> classifier_instance.raw_coef_
array([[ 0.071101940040772596  ,  0.05143724979709707323,  0.071101940040772596  , -0.04089477198935181912, -0.0407380696457252528 ,  0.03622160087086594843,  0.01055345545606742319,
         0.01071861708285645406, -0.36248634699444892693, -0.06159019047096317423,  0.02370064668025737009,  0.02370064668025737009, -0.03159781822495803805,  0.11221150783553821006,
         0.02728295348681779309,  0.071101940040772596  ,  0.071101940040772596  ,  0.                    ,  0.10882033432637286396,  0.64630314505709030026,  0.09617956519989406816,
         0.0604133873444507169 ,  0.                    ,  0.04111685986987245051,  0.                    ,  0.                    ,  0.18312324521915510078,  0.071101940040772596  ,
         0.071101940040772596  ,  0.                    , -0.59561802045324663268, -0.61490898457874587635,  1.07812569991461248975,  0.071101940040772596  ]])

classifier_instance.fit(train_examples_features, train_examples_labels)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=1, tol=0.0001)

>>> classifier_instance.raw_coef_
array([[ 0.07110193825129411394,  0.05143724970282205489,  0.07110193825129411394, -0.04089477178162870957, -0.04073806899140903354,  0.03622160048165772028,  0.010553455400928528  ,
         0.01071860364222424096, -0.36248635488413910588, -0.06159021545062405567,  0.02370064608376460866,  0.02370064608376460866, -0.03159783710841745225,  0.11221149816037970237,
         0.02728295411479400578,  0.07110193825129411394,  0.07110193825129411394,  0.                    ,  0.10882033461822394893,  0.64630314701686075729,  0.09617956493834901865,
         0.06041338563697066372,  0.                    ,  0.04111676713793514099,  0.                    ,  0.                    ,  0.18312324401049043243,  0.07110193825129411394,
         0.07110193825129411394,  0.                    , -0.59561803345113684127, -0.61490899867901249731,  1.07812569539027203191,  0.07110193825129411394]])

I'm using version 0.14, the docs specify "The underlying C implementation uses a random number generator to select features when fitting the model. It is thus not uncommon, to have slightly different results for the same input data. If that happens, try with a smaller tol parameter."

I thought that setting the random state would make sure there is no randomness but apparently this is not the case. Is this a bug or desired behavior?

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
jonathans
  • 320
  • 3
  • 9
  • I also noticed that sometimes that this behavior changes between "runs", one time I start python repeated calls to fit generate different coefficients and in other times I restart python I does not. Very strange. – jonathans Jun 26 '14 at 18:14

2 Answers2

3

It's not really desired, but it's a known issue that is very hard to fix. The thing is that LogisticRegression models are trained with Liblinear, which does not allow setting its random seed in a completely robust way. When you explicitly set the random_state, a best effort is made to set Liblinear's random seed, but that may fail.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 2
    If it is essential to have exactly the same coefficients, you can write a function `get_logistic_regression_coef` which fits the model and returns the coefficients, and then cache it using `sklearn.externals.joblib.Memory`. That way it will save the results to disk for a given input and reload those at the second call if the input didn't change. – eickenberg Jun 26 '14 at 15:25
  • If in every run all the digits after say 6 decimal places change then for all intents and purposes they are meaningless, no? what about just rounding the coefficients after those say 6 decimal points? that way the results will be deterministic. – jonathans Jun 26 '14 at 20:22
  • @jonathans There might still be random flukes that cause bigger differences. I feel your suggestion amounts to hiding the problem, rather than solving it (e.g. by installing a different RNG in Liblinear, which is tricky but possible). – Fred Foo Jun 27 '14 at 08:00
  • I agree, at this point I'm resorting to what I can personally control and fixing this directly in Liblinear is not one of the things I feel I'm qualified to do. – jonathans Jun 27 '14 at 21:37
  • 1
    Does that also apply for the ‘lbfgs’ and ‘newton-cg’ solvers? – Franck Dernoncourt Nov 01 '15 at 05:13
  • Does anyone know if there has been a recent fix for this since then? I am using sklearn version 0.19.1 – tsando May 21 '18 at 16:56
1

I was baffled by the problem as well, but eventually found that it was also necessary to call numpy.random.seed() to set the state of numpy's internal RNG, in addition to passing random_state.

This was tested with sklearn 0.13.1.

  • This worked for me! Do we know why it works? Perhaps the srand function inside logistic regression uses the global numpy random seed? – P.C. Feb 19 '18 at 05:04
  • This didn't work for me... I am using sklearn 0.19.1. @P.C. which sklearn version were you using? – tsando May 21 '18 at 16:55
  • I have both set, but the result is still varying. Do you have to set them to the same value or different values? any tricks there? – Sapiens Oct 15 '20 at 10:38