7

I'm using statsmodels for logistic regression analysis in Python. For example:

import statsmodels.api as sm
import numpy as np
x = arange(0,1,0.01)
y = np.random.rand(100)
y[y<=x] = 1
y[y!=1] = 0
x = sm.add_constant(x)
lr = sm.Logit(y,x)
result = lr.fit().summary()

But I want to define different weightings for my observations. I'm combining 4 datasets of different sizes, and want to weight the analysis such that the observations from the largest dataset do not dominate the model.

double-beep
  • 5,031
  • 17
  • 33
  • 41
user2448817
  • 119
  • 1
  • 2
  • 6
  • statsmodels currently supports weights only for the linear regression model. – Josef May 04 '14 at 00:55
  • 1
    GLM with family binomial allows: `Binomial family models accept a 2d array with two columns. If supplied, each observation is expected to be [success, failure].` It might be possible to use this to define sample weights, but I never tried. – Josef May 04 '14 at 01:11
  • **update** GLM in statsmodels has now weights option for var_weights and freq_weights. – Josef Aug 30 '21 at 14:57

2 Answers2

6

Took me a while to work this out, but it is actually quite easy to create a logit model in statsmodels with weighted rows / multiple observations per row. Here's how's it's done:

import statsmodels.api as sm
logmodel=sm.GLM(trainingdata[['Successes', 'Failures']], trainingdata[['const', 'A', 'B', 'C', 'D']], family=sm.families.Binomial(sm.families.links.logit)).fit()
user3805082
  • 2,076
  • 1
  • 14
  • 9
  • Thanks! Just an FYI (because I got an error that took me a few minutes to catch), whereas patsy within the ``glm(...)`` function will dummy-fy categorical variables for you, in the ``GLM(...)`` above you have to do it yourself, for ex. with ``pandas.get_dummies(...)`` – DrMisha May 18 '15 at 21:07
  • 17
    This is quite unclear. Can you clarify how you are introducing weighting? – Will Beauchamp Jul 09 '15 at 18:11
  • 6
    `Successes` and `Failures` within `trainingdata` refer to the number of observations of each. This number of observations therefore provides a weighting. – user3805082 Aug 15 '16 at 19:12
1

Not sure About statsmodel,

But with scikit learn is very easy. You could use an SGDClassifier with sample_weight

Example:

import numpy as np
from sklearn import linear_model
X = [[0., 0.], [1., 1.]]
y = [0, 1]
weight=[0.5,0.5]
#log implies logistic regression
clf = linear_model.SGDClassifier(loss="log" )
clf.fit(X, y, sample_weight =weight)
print(clf.predict([[-0.8, -1]]))
kazAnova
  • 219
  • 1
  • 7