how does R handle NA values vs deleted values with regressions

Question

Say I have a table and I remove all the inapplicable values and I ran a regression. If I ran the exact same regression on the same table, but this time instead of removing the inapplicable values, I turned them into NA values, would the regression still give me the same coefficients?

Yes. The regression would omit any NA values anyway (i.e. deleting them before doing the analysis). You can check this by comparing the degrees of freedom for both models. — deschen, Feb 07 '21 at 23:40
To be more precise. Any row containing at least one NA in any of the predictor or outcome variables will be dropped prior to the analysis. — deschen, Feb 07 '21 at 23:54

score 2 · Accepted Answer · answered Feb 08 '21 at 07:46

The regression would omit any NA values prior to doing the analysis (i.e. deleting any row that contains a missing NA in any of the predictor variables or the outcome variable). You can check this by comparing the degrees of freedom and other statistics for both models.

Here's a toy example:

head(mtcars)

# check the data set size (all non-missings)
dim(mtcars) # has 32 rows

# Introduce some missings
set.seed(5)
mtcars[sample(1:nrow(mtcars), 5), sample(1:ncol(mtcars), 5)] <- NA

head(mtcars)

# Create an alternative where all missings are omitted
mtcars_NA_omit <- na.omit(mtcars)

# Check the data set size again
dim(mtcars_NA_omit) # Now only has 27 rows

# Now compare some simple linear regressions
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars))
summary(lm(mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit))

Comparing the two summaries you can see that they are identical, with the one exception that for the first model, there's a warning message that 5 csaes have been dropped due to missingness, which is exactly what we did manually in our mtcars_NA_omit example.

# First, original model

Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0835 -1.7594 -0.2023  1.4313  5.6948 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.64284    7.02359   4.220 0.000352 ***
cyl         -1.04494    0.83565  -1.250 0.224275    
hp          -0.03913    0.01918  -2.040 0.053525 .  
am           4.02895    1.90342   2.117 0.045832 *  
gear         0.31413    1.48881   0.211 0.834833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.947 on 22 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7635 
F-statistic: 21.98 on 4 and 22 DF,  p-value: 2.023e-07

# Second model where we dropped missings manually    

Call:
lm(formula = mpg ~ cyl + hp + am + gear, data = mtcars_NA_omit)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0835 -1.7594 -0.2023  1.4313  5.6948 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 29.64284    7.02359   4.220 0.000352 ***
cyl         -1.04494    0.83565  -1.250 0.224275    
hp          -0.03913    0.01918  -2.040 0.053525 .  
am           4.02895    1.90342   2.117 0.045832 *  
gear         0.31413    1.48881   0.211 0.834833    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.947 on 22 degrees of freedom
Multiple R-squared:  0.7998,    Adjusted R-squared:  0.7635 
F-statistic: 21.98 on 4 and 22 DF,  p-value: 2.023e-07

how does R handle NA values vs deleted values with regressions

1 Answers1