1

I was building a logistic regression model in r but when I checked the coefficients using summary(model) the output displayed NA's in the four columns (estimate, standard error, z value and z) for one of my independent variables. My other three variables worked fine.

I also checked for any null values but there were none. I changed it between a continuous and discrete value using as.numeric and as.integer but it still comes out as NA in the output. The variable itself measures total volume of blood donated.

I can't figure this out and it is bothering me. Thanks

Poly
  • 13
  • 1
  • 4
  • 2
    Relevant posts: [Logistic regression in R returning NA values](https://stats.stackexchange.com/questions/25839/logistic-regression-in-r-returning-na-values) and [NA in glm model](https://stats.stackexchange.com/questions/212903/na-in-glm-model) – Maurits Evers Apr 01 '18 at 12:39
  • 1
    tldr: Collinearity between predictor variables will result in `NA` "estimates" for (some of the) predictor variables that are linearly dependent. – Maurits Evers Apr 01 '18 at 12:44

1 Answers1

2

Here is an example elaborating on the comment I made above; I'm using a simple linear model here, but the same principle applies for your logistic regression model.

  1. Let's generate some data: We generate data for a model y = x1 + x2 + epsilon, where the two predictor variables x1 and x2 are linearly dependent: x2 = 2.5 * x1.

    # Generate sample data
    set.seed(2017);
    x1 <- seq(1, 100);
    x2 <- 2.5 * x1;
    y <- x1 + x2 + rnorm(100);
    
  2. We fit the model.

    df <- cbind.data.frame(x1 = x1, x2 = x2, y = y);
    fit <- lm(y ~ x1 + x2, df);
    
  3. Look at parameter estimates.

    summary(fit);
    #
    #Call:
    #lm(formula = y ~ x1 + x2, data = df)
    #
    #Residuals:
    #     Min       1Q   Median       3Q      Max
    #-2.50288 -0.75360 -0.01388  0.67935  3.08515
    #
    #Coefficients: (1 not defined because of singularities)
    #            Estimate Std. Error t value Pr(>|t|)
    #(Intercept) 0.166567   0.215534   0.773    0.441
    #x1          3.496831   0.003705 943.719   <2e-16 ***
    #x2                NA         NA      NA       NA
    #---
    #Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    #
    #Residual standard error: 1.07 on 98 degrees of freedom
    #Multiple R-squared:  0.9999,   Adjusted R-squared:  0.9999
    #F-statistic: 8.906e+05 on 1 and 98 DF,  p-value: < 2.2e-16
    

You can see that estimates for x2 are NA. This is a direct consequence of x1 and x2 being linearly dependent. In other words, x2 is redundant, and the data can be described by the estimated linear model y = 3.4968 * x1 + epsilon; this is obviously in good agreement with the theoretical coefficient x1 + 2.5 * x1 = 3.5 * x1.

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Wow thank you. You're right, I checked again and it is definitely redundant because the variable had a perfect correlation of 1 with another variable which I must have missed because I thought it was correlated against itself. That's embarrassing.. I should probably remove this thread.. Thanks! – Poly Apr 01 '18 at 14:36
  • No problem @Poly; general SO practice is to *not* remove posts. They might be useful (and get referenced) in future questions. You should however accept the solution by setting the check-mark next to the solution to mark the question as closed. – Maurits Evers Apr 01 '18 at 14:38