perfect multicollinearity in glm

Question

I wanted to know how to solve the problem of perfect multicollinearity in a glm that I fit in R I wanna to see if the morphological measures can predict the a bird's arrival day in territory, so I have tarsus, wing and tail, I'm also want to see the difference in males and females.

So, I'm using the code:

myggod <- glm(day_territory ~ sex * (Right_tarsus + Right_wing + 
                  Tail_length), data = territory, family = "poisson")

that show the follow output:

                         Estimate Std. Error z value Pr(>|z|)   
(Intercept)             19.626581  17.831173   1.101  0.27103   
sexfemale              -14.645707  17.852832  -0.820  0.41201   
sexmale                -12.343274  17.835662  -0.692  0.48890   
Right_tarsus            -0.920874   1.233841  -0.746  0.45546   
Right_wing              -0.007466   0.016571  -0.451  0.65233   
Tail_length             -0.043216   0.013195  -3.275  0.00106 **
sexfemale:Right_tarsus   0.883152   1.234115   0.716  0.47423   
sexmale:Right_tarsus     0.846497   1.233209   0.686  0.49245   
sexfemale:Right_wing     0.018863   0.020855   0.904  0.36574   
sexmale:Right_wing             NA         NA      NA       NA   
sexfemale:Tail_length    0.021428   0.015584   1.375  0.16911   
sexmale:Tail_length            NA         NA      NA       NA

So, I have perfect multicollinearity to male's tail and wing

I already tried use scale and center = true, use the measures minus the mean, use log and use a PC1 made of an PCA using wing and tail

nothing worked, i have the same issue with all of these methods, even when both measures are just the PC1 the same NAs appears ...

So, how can I solve it?

I doubt this is “perfect colinearity” I suspect it’s more over parametrisation in your model. In which case the only thing you need to do is understand your model. It’s not a problem. But without seeing your data it’s impossible to be sure. — Limey, Apr 03 '23 at 22:25
Hello Limey, thank you so much, well, I have here the simplified version of the data, that have Just the variables that I'm checking: https://drive.google.com/file/d/1OMaVfeUipRsa1njydYTAgls9pPCVFCdD/view?usp=sharing — Tarso Ciolete, Apr 04 '23 at 00:27

Len Greski · Answer 1 · 2023-04-05T00:00:00.407

We can eliminate the overparameterization problem by removing the interaction effects from the model.

if(!dir.exists("./data")) dir.create("./data")
download.file("https://drive.google.com/uc?export=download&id=1OMaVfeUipRsa1njydYTAgls9pPCVFCdD",
              "./data/bird_stats.csv",mode="w")

df <- read.csv("./data/bird_stats.csv",sep = ";")

aModel <-glm(day_territory ~ sex + Right_tarsus + Right_wing + Tail_length, 
             data = df, family = "poisson")
summary(aModel)

...and the output:

Call:
glm(formula = day_territory ~ sex + Right_tarsus + Right_wing + 
    Tail_length, family = "poisson", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.4278  -1.5146  -0.4210   0.9837   6.6771  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   6.010191   0.670813   8.960  < 2e-16 ***
sexfemale     0.163442   0.098671   1.656  0.09763 .  
sexmale      -0.167495   0.102225  -1.638  0.10132    
Right_tarsus -0.056499   0.019188  -2.944  0.00323 ** 
Right_wing    0.002091   0.009921   0.211  0.83311    
Tail_length  -0.030275   0.006944  -4.360  1.3e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 712.81  on 101  degrees of freedom
Residual deviance: 493.95  on  96  degrees of freedom
  (83 observations deleted due to missingness)
AIC: 1085.5

Number of Fisher Scoring iterations: 4

The AIC on the overparameterized model is 1087.8, so the model with fewer parameters is slightly better than the overparameterized one.

Note that almost half the observations in the data frame were deleted from the analysis due to missing values. You'll need to review the missing data and make some decisions about strategies for interpolating missing data, or collect more data to assess whether the sex variable is meaningful.

Also, the dependent variable in a poisson model is typically a count, but from the original question it's hard to understand why a poisson model is being used here. That is, if the variables Right_tarsus Tight_wing and Tail_length are size measurements of birds, why would size measurements predict counts?

If the dependent variable is the day of arrival in a specific location, a poisson model probably isn't the right model.

I used poison because the variable day_territory is a count, it is the day that each bird arrives in the territory in relation to the first day of the season It is impossible for me to collect more data And my idea is to see if there is any morphological pattern in these arrival dates, however, I need to separate males from females, since the ecological pressures that act on them are different — Tarso Ciolete, Apr 04 '23 at 20:27
@TarsoCiolete - Days since the season began isn't a "count" unless aggregated into 5 birds arrived on day 1, 20 birds arrived on day 2, etc. Poisson variables are used to model rates, and your dependent variable isn't a rate given how you described it in the comment. — Len Greski, Apr 04 '23 at 23:58

perfect multicollinearity in glm

1 Answers1