Can I correct the coefficient standard errors after oversampling my data?

Question

I am trying to fit a fixed effects linear regression to my data and interpret the coefficients. I have an imbalanced dataset (~97% negative cases), which was affecting my ability to fit the model and calculate coefficients for every independent variable, so I used SMOTE to oversample the positive cases and roughly double the size of my dataset. I care way more about the coefficient values and standard errors than the actual predictive accuracy of the model-- the question I am trying to answer is "what is the effect of x on y?" But because my SMOTE dataset is twice as large as my original dataset, my standard errors are artificially small/overconfident. Is there a way to correct for this and keep the SMOTE coefficient estimates while calculating standard errors based on the original data?

score 0 · Answer 1 · answered Mar 16 '23 at 11:47

0

You have to correct this by doing something like this - Recalibrate predicted probabilities.

Or you can do a weighted regression as well -

weights = np.where(original_data_flag, 1/np.mean(original_data_flag), 1/np.mean(~original_data_flag))

lm = LinearRegression()
lm.fit(x, y, sample_weight=weights)

answered Mar 16 '23 at 11:47

Next Door Engineer

2,818
4
20
33

This is definitely along the lines of what I'm trying to do, but these suggestions are about modifying the predicted probabilities, and I want to modify the standard errors of the coefficient estimates. – cbowers Mar 16 '23 at 21:46

Can I correct the coefficient standard errors after oversampling my data?

1 Answers1