2

When using the linear_model.LogisticRegression() function to obtain 3 regression coefficients for 3 predictor variables (X) against a dependent variable (y), I call the logref.coef_ function to see each coefficient and instead about 200 values appeared. How did it manage to register many more features than I had input originally? I know I am only meant to have 3 variables that correspond with each predictor variable.

Image1 shows the output for : dataframe.head()

Image 2 shows the output for : print(logreg.coef_)

import pandas as pd
import numpy as np
from scipy import stats
from sklearn import linear_model

data = pd.read_excel('DATASET')

dataframe = data[['GNIpc','Marriage female', 'waged male','waged  female']].replace('..', 
np.nan).dropna()

X = dataframe[['Marriage female', 'waged male','waged  female']]
y = dataframe[['GNIpc']]
logreg = linear_model.LogisticRegression()
logreg.fit(X, y)

print(logreg.coef_)
>>> [[-0.0532999   0.0282386  -0.36440672]
    [-0.03349039 -0.09097408  0.0516077 ]
    [ 0.02133783 -0.10573915  0.03944377]
    [-0.02723709 -0.09365962  0.0625376 ]
    [-0.02377661  0.10073943 -0.6386778 ]
    [-0.0162161  -0.05130708 -0.21533241]
    [-0.09565614  0.03214048 -0.12573514]
    [-0.11774399  0.04124659 -0.08295302]
    [ 0.01697128 -0.3196196   0.18449796]
    [-0.03153424 -0.09193552  0.02516725]
    [ 0.00496581 -0.297038    0.17636911]
    [ 0.02503764 -0.13152531 -0.36763286]
    [-0.52025686  0.3663963  -0.46018477]
    [ 0.12337318 -0.41343403 -0.83253983]
    [-0.01623575 -0.02691109 -0.06407165]
    [-0.01307591 -0.10721795  0.06188949]
    [-0.08106017  0.02097464 -0.06847169]
    [-0.03246505 -0.12340276  0.03465779]
    [-0.03058392 -0.17116052  0.13834497]
    [-0.04529128 -0.08847383  0.06050442]
    [ 0.00324746 -0.70348851  0.5887903 ]
    [-0.0730169   0.04685963 -0.17306655]
    [-0.20895759  0.21741604 -0.2835841 ]
    [-0.04765593 -0.02911799 -0.04101694]
    [-0.06553731  0.01516212 -0.10556077]
    [-0.17959739  0.39386919 -0.97548649]
    [-0.03869242 -0.12421051  0.0962199 ]
    [-0.02286379 -0.10571808  0.02182333]
    [-0.91660719  0.3343537  -0.31409916]
    [-0.09193558 -0.06053258  0.04748263]
    [-0.10195001  0.07841969 -0.16552518]
    [-0.36625827 -0.46961584  0.43743011]
    [-0.49169925  0.01808853 -0.00918122]
    [-0.30465374  0.09363753 -0.09558291]
    [-0.06388412 -0.05418759  0.0341766 ]
    [-0.10131437 -0.00557687 -0.00839488]]

X.shape
>>> (42, 3)
y.shape
>>> (42, 1)
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
kennethm
  • 21
  • 2
  • Could you please provide the information about X.shape, and y.shape before you fitting the model. Also it will nice if you put the output of `logreg.coef_` – Batuhan B Apr 02 '20 at 18:11
  • just edited all of that in now , thanks for asking. – kennethm Apr 02 '20 at 18:18
  • Did you try to normalize the data and perform regularization? or try to eyeball the data manually, there might be some bad data-point or value! – Adeel Ijaz Apr 02 '20 at 18:06
  • I will try both now, as for manually eyeballing - the single variable and bi-variate visualisations look fairly in shape and coherent and the data types for each of the feature values are appropriate. – kennethm Apr 02 '20 at 18:21
  • Is this the only code? Did you do cross validation, fit multiple models, anything else to the data that might be relevant? – G. Anderson Apr 02 '20 at 18:36
  • If the answer solved your question, could you mark it as accepted? – tkja Apr 06 '20 at 22:21
  • I'm afraid I didn't end up solving it via these solutions. I had to use a different Stats model approach - if that is the solution then I will happily mark it off however the fact that it doesn't work so cut and paste doesn't make sense given the example I used it from. – kennethm Apr 11 '20 at 10:56

1 Answers1

1

If you take a look at the documentation of the LogisticRegression class that you are using, you will see:

coef_ is of shape (1, n_features) when the given problem is binary.

Your given problem, as it is put into the classifier, is not binary.

Also, if you inspect your coef_ carefully, you will see that you have a list of lists as output. Each inner list has three elements, which correspond to the three coefficients of the model for a binary decision problem. This is also explained in the documentation, quoting:

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’.

I would recommend reading up on the basics of Linear/Logistic Regression, Python lists, and strategies how to approach multiclass problems, for example with the One-vs-the-rest (OvR) multiclass/multilabel strategy.

tkja
  • 1,950
  • 5
  • 22
  • 40
  • Thank you so much for your answer. My mistake here I guess is that I was blankly applying code from a course to a different data set rather than trying to understand what is appropriate for it. Excuse me if I'm wrong but are you saying that because none of my predictor variables are binary that this function won't return 3 coefficients the way I want it to? – kennethm Apr 03 '20 at 10:13
  • You will have more, see also: https://stackoverflow.com/questions/48508127/how-to-get-coefficients-of-multinomial-logistic-regression?rq=1#comment84011074_48508127 You could also test different classification problems with https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html to observe how the coef_ changes when you change n_classes from 2 to a higher integer. – tkja Apr 03 '20 at 16:53