0

I am trying to fit a fixed effects linear regression to my data and interpret the coefficients. I have an imbalanced dataset (~97% negative cases), which was affecting my ability to fit the model and calculate coefficients for every independent variable, so I used SMOTE to oversample the positive cases and roughly double the size of my dataset. I care way more about the coefficient values and standard errors than the actual predictive accuracy of the model-- the question I am trying to answer is "what is the effect of x on y?" But because my SMOTE dataset is twice as large as my original dataset, my standard errors are artificially small/overconfident. Is there a way to correct for this and keep the SMOTE coefficient estimates while calculating standard errors based on the original data?

cbowers
  • 137
  • 8

1 Answers1

0

You have to correct this by doing something like this - Recalibrate predicted probabilities.

Or you can do a weighted regression as well -

weights = np.where(original_data_flag, 1/np.mean(original_data_flag), 1/np.mean(~original_data_flag))

lm = LinearRegression()
lm.fit(x, y, sample_weight=weights)
Next Door Engineer
  • 2,818
  • 4
  • 20
  • 33
  • This is definitely along the lines of what I'm trying to do, but these suggestions are about modifying the predicted probabilities, and I want to modify the standard errors of the coefficient estimates. – cbowers Mar 16 '23 at 21:46