I'm trying to understand how to use categorical variables in a linear regression in R. I have some insurance data that has a categorical variable of coverage type (basic, extended and premium). When I run a simple linear regression:
summary(lm(Customer.Lifetime.Value ~ Coverage, data = ins)) #basic sig but might not include
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8498 2829 3.004 0.0198 *
CoverageExtended -1314 5658 -0.232 0.8230
CoveragePremium 1554 5658 0.275 0.7915
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The intercept is statistically significant, which means the variable not represented "Basic" is significant.
However when I try to isolate that variable with and rerun the model:
ins$isBasic <- ifelse(ins$Coverage == "Basic", 1, 0)
summary(lm(Customer.Lifetime.Value ~ isBasic, data = ins))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8617.8 3280.5 2.627 0.0303 *
isBasic -119.8 4235.1 -0.028 0.9781
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now the basic policy is no longer significant but the new intercept is, meaning the other columns combined are sig.
I'm new to this type of regression and not sure if I'm coding or selecting the features correctly.