0

GLM shows the coefficient for "Yes" and "No", which is wrong. The GLM function usually automatically dummy codes binary factors so that only one of the levels has a coefficient.

So in this case it should provide the coefficient for "Yes", while "No" should not have a coefficient, as it is the reference level.

I have not had this issue with any other similarly coded independent variables, there seems to be something about this specific sequence of Yes, No and NA. Why is it doing this?

#Generate specific sequence of Yes and No

c <- replicate(5,"No")
d <- c("Yes","No","Yes","No","NA","Yes")

#Concatenate and add into dataframe and generate dependent variable f
df <- data.frame(e=c(c,d),
f=sample(c(0,1,2,3,4), 11, replace = TRUE, prob = NULL))

#Convert e to a factor
df$e <- as.factor(df$e)

nbd_attend<-glm.nb(f ~ e, data = df)
summary(nbd_attend)

CK7
  • 229
  • 1
  • 11

1 Answers1

2

You've included "NA" as a stringing your data -- not the special missing value NA. If you instead used

d <- c("Yes", "No", "Yes", "No", "NA", "Yes")  # bad
d <- c("Yes", "No", "Yes", "No", NA, "Yes")    # good

Then it would work.

Basically you made a factor with three levels and "NA" is the first alphabetically so it became the reference level.

levels(df$e)
# [1] "NA"  "No"  "Yes"
MrFlick
  • 195,160
  • 17
  • 277
  • 295