I'm performing a linear regression using recipes to predict salary based on rank (assoc professor, assistant professor, and full professor), sex, discipline (applied or theoretical), years of service, and years since PhD. This data set is in the cars package.
I've created dummy variables, and transformed the dependent outcome variable into a more normal shape. I've standardized years of service and years since PhD into values between 0 and 1.
salary.split <- initial_split(salary.df)
sal.train <- training(salary.split)
sal.test <- testing(salary.split)
sal.recipe <- recipe(salary ~ ., data = salary.df) %>%
step_log(salary) %>%
step_dummy(all_nominal()) %>%
step_range(yrs.since.phd) %>%
step_range(yrs.service)
sal.rec <- prep(sal.recipe, training = sal.train) %>% bake(new_data = sal.train)
sal.lm <- lm(sal.rec)
summary(sal.lm)
The results of the summary:
Call:
lm(formula = sal.rec)
Residuals:
Min 1Q Median 3Q Max
-0.17727 -0.05780 -0.01406 0.04221 0.34499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.3052564 0.3240025 -0.942 0.34690
yrs.service 0.8054404 0.0292577 27.529 < 2e-16 ***
salary 0.0375859 0.0285323 1.317 0.18877
rank_AsstProf -0.0528260 0.0184926 -2.857 0.00459 **
rank_Prof 0.0740925 0.0174977 4.234 3.08e-05 ***
discipline_B -0.0438070 0.0107863 -4.061 6.28e-05 ***
sex_Male 0.0006626 0.0165779 0.040 0.96815
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08639 on 291 degrees of freedom
Multiple R-squared: 0.8656, Adjusted R-squared: 0.8628
F-statistic: 312.2 on 6 and 291 DF, p-value: < 2.2e-16
When I look at the variable information (sal.recipe$var_info
):
# A tibble: 6 x 4
variable type role source
<chr> <chr> <chr> <chr>
1 rank nominal predictor original
2 discipline nominal predictor original
3 yrs.since.phd numeric predictor original
4 yrs.service numeric predictor original
5 sex nominal predictor original
6 salary numeric outcome original
which shows salary as an outcome, not a predictor. Why is salary showing up as a coefficient when I look at the summary information for the linear model?