How to effectively use Categorical Variables in a regression?

Question

I'm trying to understand how to use categorical variables in a linear regression in R. I have some insurance data that has a categorical variable of coverage type (basic, extended and premium). When I run a simple linear regression:

summary(lm(Customer.Lifetime.Value ~ Coverage, data = ins)) #basic sig but might not include

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)  
(Intercept)          8498       2829   3.004   0.0198 *
CoverageExtended    -1314       5658  -0.232   0.8230  
CoveragePremium      1554       5658   0.275   0.7915  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The intercept is statistically significant, which means the variable not represented "Basic" is significant.

However when I try to isolate that variable with and rerun the model:

ins$isBasic <- ifelse(ins$Coverage == "Basic", 1, 0)

summary(lm(Customer.Lifetime.Value ~ isBasic, data = ins))

            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   8617.8     3280.5   2.627   0.0303 *
isBasic       -119.8     4235.1  -0.028   0.9781  
---
Signif. codes:  
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Now the basic policy is no longer significant but the new intercept is, meaning the other columns combined are sig.

I'm new to this type of regression and not sure if I'm coding or selecting the features correctly.

Can you please check your code? Your second regression formula contains the variable „Gender“, but it is not shown in the output. So I guess you did a mistake when copying the results over to this post. — deschen, Nov 30 '20 at 23:00
@deschen my apologies. I meant to just run it with isBasic. I know all 3 levels of Gender are significant but when I add Gender is, it renders everything insignificant. I just want to make sure I'm coding things correctly and interpreting significance correctly. — Josh Ortega, Nov 30 '20 at 23:03
The intercept signicance is compared to zero. The factor levels significance is compared to the reference category (which, as you properly said, is equal to the intercept if you have only one factor as predictor). So, in the first model you are comparing basic to zero. In the second, you are comparing basic to combined premium and extended. There is no right way. It depends on your research hypothesis. That said, this is not a code question, and would be more suitable to Cross-Validated. — LuizZ, Nov 30 '20 at 23:09

Allan Cameron · Answer 1 · 2020-11-30T23:40:59.993

Your assumptions about what the model means aren't quite right. The intercept's p value only tells you that the intercept is statistically different from 0.

Suppose I take 30 numbers drawn from a normal distribution with a mean of 1:

set.seed(1)
y <- rnorm(30, 1)

Now suppose each of these is associated with a factor level - "A", "B" or "C":

df <- data.frame(x = factor(rep(c("A", "B", "C"), 10)), y = y)

Now let's do the regression of y on x and examine the model summary:

summary(lm(y ~ x, data = df))
#> 
#> Call:
#> lm(formula = y ~ x, data = df)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.2814 -0.5159  0.1899  0.6256  1.4716 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  1.12371    0.30270   3.712 0.000943 ***
#> xB          -0.05706    0.42809  -0.133 0.894958    
#> xC          -0.06671    0.42809  -0.156 0.877332    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.9572 on 27 degrees of freedom
#> Multiple R-squared:  0.00105,    Adjusted R-squared:  -0.07295 
#> F-statistic: 0.01418 on 2 and 27 DF,  p-value: 0.9859

We can see that the mean of the group of y values associated with the factor level "A" is significantly different from 0 - that's all the intercept's p value tells us. The coefficients for the factor levels "B" and "C" are not significantly different from the intercept, meaning that the y values associated with "B" and "C" are not significantly different from the y values associated with "A". That just means that there is no significant difference between your factor levels. If you want to know whether the groups within x are associated with significantly different values of y overall, you can try anova:

anova(lm(y ~ x, data = df))
#> Analysis of Variance Table
#> 
#> Response: y
#>           Df Sum Sq Mean Sq F value Pr(>F)
#> x          2  0.026  0.0130  0.0142 0.9859
#> Residuals 27 24.740  0.9163

where we see that the proportion of the overall variance in y accounted for by the variance between the mean of the three groups in x is tiny, and not statistically significant from the null hypothesis that there is no relationship between x and y - which is what we expect given how this data set was generated.

How to effectively use Categorical Variables in a regression?

1 Answers1