2

I am trying to fit my data to a zero-inflated negative binomial model but one of my 3 independent variables (Exposure) seems to be causing NaNs to be produced (see very end of zeroinfl call) when the SE is being calculated in the summary function. I have also tried running a negative binomial hurdle model and am running into a similar issue.

str(eggTreat)
'data.frame':   455 obs. of  4 variables:
 $ Exposure : Factor w/ 2 levels "C","E": 2 2 2 2 2 2 2 2 2 2 ...
 $ hi_lo    : Factor w/ 2 levels "hi","lo": 2 2 2 2 2 2 2 2 2 2 ...
 $ Egg_count: int  0 0 0 0 0 0 0 0 0 0 ...
 $ Food     : Factor w/ 2 levels "1.5A5YS","5ASMQ": 2 2 2 2 2 2 2 2 2 2 ...
mod.zeroinfl <- zeroinfl(Egg_count ~ Food+Exposure+hi_lo | Food+Exposure+hi_lo, data=eggTreat,
+                          dist="negbin")
> summary(mod.zeroinfl)

Call:
zeroinfl(formula = Egg_count ~ Food + Exposure + hi_lo | Food + Exposure + hi_lo, data = eggTreat, dist = "negbin")

Pearson residuals:
     Min       1Q   Median       3Q      Max 
-0.65632 -0.47163 -0.28588  0.02976  9.00804 

Count model coefficients (negbin with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.04435    0.14393  -0.308   0.7580    
Food        -1.12486    0.22267  -5.052 4.38e-07 ***
Exposure    -2.34990    0.38684  -6.075 1.24e-09 ***
hi_lo       -0.44893    0.19524  -2.299   0.0215 *  
Log(theta)  -0.24387    0.22639  -1.077   0.2814    

Zero-inflation model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.830e+01         NA      NA       NA
Food        -5.768e+00  5.628e+04       0        1
Exposure     4.612e-01         NA      NA       NA
hi_lo       -7.477e+00  9.963e+05       0        1
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Theta = 0.7836 
Number of iterations in BFGS optimization: 21 
Log-likelihood: -350.2 on 9 Df
Warning message:
In sqrt(diag(object$vcov)) : NaNs produced
function (object, ...) 
{
  object$residuals <- residuals(object, type = "pearson")
  kc <- length(object$coefficients$count)
  kz <- length(object$coefficients$zero)
  se <- sqrt(diag(object$vcov))
Peter O.
  • 32,158
  • 14
  • 82
  • 96
alk20
  • 57
  • 5

2 Answers2

1

This problem is typically caused by complete separation; using this search term, or searching for the Hauck-Donner effect, will show you that the problem is that there is some linear combination of your predictor values that perfectly separates the zeros and non-zeros (since the predictor variables in your zero-inflation are all categorical, this translates to a combination of categories where all the values are zero or non-zero).

I would take a look at with(eggTreat, table(eggcount>0, Food, Exposure, hi_lo)) (arrange the arguments in whatever order makes the table easiest to read).

Typical symptoms include:

  • large values of the parameters (e.g. |beta|>10); in this case your intercept is -18.3, which gives a predicted zero-inflation probability of 1e-8 in the baseline category (two of the other values are also large, although not nearly as extreme as the intercept)
  • extremely large standard errors (Food, hi_lo), leading to z-values that are effectively zero and p-values of effectively 1
  • or ... the NA values you're seeing

There are various solutions to this problem:

  • different forms of regularization or Bayesian priors
  • compute the p-values using model comparison/likelihood ratio tests
Zero-inflation model coefficients (binomial with logit link):
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.830e+01         NA      NA       NA
Food        -5.768e+00  5.628e+04       0        1
Exposure     4.612e-01         NA      NA       NA
hi_lo       -7.477e+00  9.963e+05       0        1
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
1

An answer to a similar problem has been posted here CrossValidated-NA-ZINB but what I found useful was to re-scale my variables: eg. I had the number of hectares of forest in a village num_hec that ranged in [0, 100000] and built a new variable in the scale of hundreds of sqkm (dividing by 10,000) num_hec_100sqkm ranging in [0, 10] and when using the latter the NaN that were shown for Std. Error, z value and Pr(>|z|) turned into actual numbers.

Here I cite Achim Zeileis author of zeroinfl (post in Narkive)

The simple solution is to scale the problematic regressor so that it's coefficient is reasonably sized (e.g., divide by 1000 or 10000 or so)

terraviva
  • 11
  • 4