Is there a way to implement sample weights?

Question

I'm using statsmodels for logistic regression analysis in Python. For example:

import statsmodels.api as sm
import numpy as np
x = arange(0,1,0.01)
y = np.random.rand(100)
y[y<=x] = 1
y[y!=1] = 0
x = sm.add_constant(x)
lr = sm.Logit(y,x)
result = lr.fit().summary()

But I want to define different weightings for my observations. I'm combining 4 datasets of different sizes, and want to weight the analysis such that the observations from the largest dataset do not dominate the model.

statsmodels currently supports weights only for the linear regression model. — Josef, May 04 '14 at 00:55
GLM with family binomial allows: `Binomial family models accept a 2d array with two columns. If supplied, each observation is expected to be [success, failure].` It might be possible to use this to define sample weights, but I never tried. — Josef, May 04 '14 at 01:11
**update** GLM in statsmodels has now weights option for var_weights and freq_weights. — Josef, Aug 30 '21 at 14:57

score 6 · Accepted Answer · answered Jul 04 '14 at 10:17

6

Took me a while to work this out, but it is actually quite easy to create a logit model in statsmodels with weighted rows / multiple observations per row. Here's how's it's done:

import statsmodels.api as sm
logmodel=sm.GLM(trainingdata[['Successes', 'Failures']], trainingdata[['const', 'A', 'B', 'C', 'D']], family=sm.families.Binomial(sm.families.links.logit)).fit()

answered Jul 04 '14 at 10:17

user3805082

2,076
1
14
9

Thanks! Just an FYI (because I got an error that took me a few minutes to catch), whereas patsy within the ``glm(...)`` function will dummy-fy categorical variables for you, in the ``GLM(...)`` above you have to do it yourself, for ex. with ``pandas.get_dummies(...)`` – DrMisha May 18 '15 at 21:07
17

This is quite unclear. Can you clarify how you are introducing weighting? – Will Beauchamp Jul 09 '15 at 18:11
6

`Successes` and `Failures` within `trainingdata` refer to the number of observations of each. This number of observations therefore provides a weighting. – user3805082 Aug 15 '16 at 19:12

score 1 · Answer 2 · answered Apr 28 '14 at 13:51

Not sure About statsmodel,

But with scikit learn is very easy. You could use an SGDClassifier with sample_weight

Example:

import numpy as np
from sklearn import linear_model
X = [[0., 0.], [1., 1.]]
y = [0, 1]
weight=[0.5,0.5]
#log implies logistic regression
clf = linear_model.SGDClassifier(loss="log" )
clf.fit(X, y, sample_weight =weight)
print(clf.predict([[-0.8, -1]]))

Is there a way to implement sample weights?

2 Answers2