I am trying to fit a fixed effects linear regression to my data and interpret the coefficients. I have an imbalanced dataset (~97% negative cases), which was affecting my ability to fit the model and calculate coefficients for every independent variable, so I used SMOTE to oversample the positive cases and roughly double the size of my dataset. I care way more about the coefficient values and standard errors than the actual predictive accuracy of the model-- the question I am trying to answer is "what is the effect of x on y?" But because my SMOTE dataset is twice as large as my original dataset, my standard errors are artificially small/overconfident. Is there a way to correct for this and keep the SMOTE coefficient estimates while calculating standard errors based on the original data?
Asked
Active
Viewed 49 times
1 Answers
0
You have to correct this by doing something like this - Recalibrate predicted probabilities.
Or you can do a weighted regression as well -
weights = np.where(original_data_flag, 1/np.mean(original_data_flag), 1/np.mean(~original_data_flag))
lm = LinearRegression()
lm.fit(x, y, sample_weight=weights)

Next Door Engineer
- 2,818
- 4
- 20
- 33
-
This is definitely along the lines of what I'm trying to do, but these suggestions are about modifying the predicted probabilities, and I want to modify the standard errors of the coefficient estimates. – cbowers Mar 16 '23 at 21:46