8

I'm using scikit learn's Logistic Regression for a multiclass problem.

logit = LogisticRegression(penalty='l1')
logit = logit.fit(X, y)

I'm interested in which features are driving this decision.

logit.coef_

The above gives me a beautiful dataframe in (n_classes, n_features) format, but all the classes and feature names are gone. With features, that's okay, because making the assumption that they're indexed the same way as I passed them in seems safe...

But with classes, it's a problem, since I never explicitly passed in the classes in any order. So which class do coefficient sets (rows in the dataframe) 0, 1, 2, and 3 belong to?

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
Alex Lenail
  • 12,992
  • 10
  • 47
  • 79
  • It will simply be ordered from index 0 to `n_classses-1`. Did you pass numeric or strings in `y`? If strings then LabelEncoder will be used on it to convert to numeric form. Can you show your `y` here? – Vivek Kumar Apr 26 '17 at 00:51
  • Strings. labels are: `array(['GR3', 'GR4', 'SHH', 'GR3', 'GR4', 'SHH', 'GR4', 'SHH', 'GR4', 'WNT', 'GR3', 'GR4', 'GR3', 'SHH', 'SHH', 'GR3', 'GR4', 'SHH', 'GR4', 'GR3', 'SHH', 'GR3', 'SHH', 'GR4', 'SHH', 'GR3', 'GR4', 'GR4', 'SHH', 'GR4', 'SHH', 'GR4', 'GR3', 'GR3', 'WNT', 'SHH', 'GR4', 'SHH', 'SHH', 'GR3', 'WNT', 'GR3', 'GR4', 'GR3', 'SHH'], dtype=object)` and I get classes `0, 1, 2, 3`. Which corresponds to which? – Alex Lenail Apr 26 '17 at 01:51
  • Is there some way to access the LabelEncoder object inside the LogisticRegression object? – Alex Lenail Apr 26 '17 at 02:01

1 Answers1

12

The order will be same as returned by the logit.classes_ (classes_ is an attribute of the fitted model, which represents the unique classes present in y) and mostly they will be arranged alphabetically in case of strings.

To explain it, we the above mentioned labels y on an random dataset with LogisticRegression:

import numpy as np
from sklearn.linear_model import LogisticRegression

X = np.random.rand(45,5)
y = np.array(['GR3', 'GR4', 'SHH', 'GR3', 'GR4', 'SHH', 'GR4', 'SHH',
              'GR4', 'WNT', 'GR3', 'GR4', 'GR3', 'SHH', 'SHH', 'GR3', 
              'GR4', 'SHH', 'GR4', 'GR3', 'SHH', 'GR3', 'SHH', 'GR4', 
              'SHH', 'GR3', 'GR4', 'GR4', 'SHH', 'GR4', 'SHH', 'GR4', 
              'GR3', 'GR3', 'WNT', 'SHH', 'GR4', 'SHH', 'SHH', 'GR3',
              'WNT', 'GR3', 'GR4', 'GR3', 'SHH'], dtype=object)

lr = LogisticRegression()
lr.fit(X,y)

# This is what you want
lr.classes_

#Out:
#    array(['GR3', 'GR4', 'SHH', 'WNT'], dtype=object)

lr.coef_
#Out:
#    array of shape [n_classes, n_features]

So in the coef_ matrix, the index 0 in rows represents the 'GR3' (the first class in classes_ array, 1 = 'GR4' and so on.

Hope it helps.

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thanks a ton Vivek! – Alex Lenail Apr 26 '17 at 23:39
  • 1
    @Vivek can you also please explain how to get coefficients seperately for positive and negative class in binary problem i.e, clf.classes_ = [0, 1] and clf.coef_ = [[ some values]] – Aman Mar 10 '19 at 15:56