I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm()
for this because I work with large data sets and use quite complex model formulae - and glm()
just won't cope.
In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.
Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit()
although revoScaleR::rxGlm()
produces the same output - and has the same problem.
Consider this reprex:
df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
y = c(0, 1, 0, 1)) # number of successes
df_reprex$p <- df_reprex$y / df_reprex$x # success rate
# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number
glm_1 <- glm(p ~ 1,
family = binomial,
data = df_reprex,
weights = x)
exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct
glm_2 <- rxLogit(p ~ 1,
data = df_reprex,
pweights = "x")
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect
The first call to glm()
produces the correct answer. The second call to rxLogit()
does not. Reading the docs for rxLogit()
: https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".
So it looks like rxLogit()
needs me to use y
as the dependent variable rather than p
. However if I run
glm_2 <- rxLogit(y ~ 1,
data = df_reprex,
pweights = "x")
I get an overall average
exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))
of 0.5 instead, which also isn't the correct answer.
Does anyone know how I can fix this? Do I need to use an offset()
term in the model formula, or change the weights, or...
(by using the revoScaleR
package I occasionally painting myself into a corner like this, because not many other seem to use it)