We upgraded our sklearn from the old 0.13-git to 0.14.1, and find the performance of our logistic regression classifier changed quite a bit. The two classifiers trained with the same data have different coefficients, and thus often give different classification results.
As an experiment I used 5 data points (high dimensional) to train the LR classifier, and the results are:
0.13-git:
clf.fit(data_test.data, y)
LogisticRegression(C=10, class_weight='auto', dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', tol=0.0001)
np.sort(clf.coef_)
array([[-0.12442518, -0.11137502, -0.11137502, ..., 0.05428562,
0.07329358, 0.08178794]])
0.14.1:
clf1.fit(data_test.data, y)
LogisticRegression(C=10, class_weight='auto', dual=False, fit_intercept=True,
intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)
np.sort(clf1.coef_)
array([[-0.11702073, -0.10505662, -0.10505662, ..., 0.05630517,
0.07651478, 0.08534311]])
I would say the difference is quite big, in the range of 10^(-2). Obviously the data I used here is not ideal, because the dimensionality of features is much bigger than the number of entries. However, it is often the case in practice too. Does it have something to do with feature selection? How can I make the results the same as before? I understand the new results are not necessarily worse than before, but now the focus is to make them as consistent as possible. Thanks.