Logistic Regression in R: glm() vs rxGlm()

Question

I fit a lot of GLMs in R. Usually I used revoScaleR::rxGlm() for this because I work with large data sets and use quite complex model formulae - and glm() just won't cope.

In the past these have all been based on Poisson or gamma error structures and log link functions. It all works well.

Today I'm trying to build a logistic regression model, which I haven't done before in R, and I have stumbled across a problem. I'm using revoScaleR::rxLogit() although revoScaleR::rxGlm() produces the same output - and has the same problem.

Consider this reprex:

df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
                        y = c(0, 1, 0, 1)) # number of successes

df_reprex$p <- df_reprex$y / df_reprex$x # success rate

# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number

glm_1 <- glm(p ~ 1,
             family = binomial,
             data = df_reprex,
             weights = x)

exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct

glm_2 <- rxLogit(p ~ 1,
                 data = df_reprex,
                 pweights = "x")

exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect

The first call to glm() produces the correct answer. The second call to rxLogit() does not. Reading the docs for rxLogit(): https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/rxlogit it states that "Dependent variable must be binary".

So it looks like rxLogit() needs me to use y as the dependent variable rather than p. However if I run

glm_2 <- rxLogit(y ~ 1,
                 data = df_reprex,
                 pweights = "x")

I get an overall average

exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1]))

of 0.5 instead, which also isn't the correct answer.

Does anyone know how I can fix this? Do I need to use an offset() term in the model formula, or change the weights, or...

(by using the revoScaleR package I occasionally painting myself into a corner like this, because not many other seem to use it)

I'm not 100% sure here (I do not use revoScaleR) but can you try using fweights instead of pweights? Some discussion around pweights and fweights can be found here: https://www.statalist.org/forums/forum/general-stata-discussion/general/1413514-stata-fweights-versus-pweight. In this case, fweights seems more appropriate. — jav, Apr 21 '20 at 07:49

swihart · Answer 1 · 2020-04-24T18:44:29.820

I'm flying blind here because I can't verify these in RevoScaleR myself -- but would you try running the code below and leave a comment as to what the results were? I can then edit/delete this post accordingly

Two things to try:

Expand data, get rid of weights statement
use cbind(y,x-y)~1 in either rxLogit or rxGlm without weights and without expanding data

If the dependent variable is required to be binary, then the data has to be expanded so that each row corresponds to each 1 or 0 response and then this expanded data is run in a glm call without a weights argument.

I tried to demonstrate this with your example by applying labels to df_reprex and then making a corresponding df_reprex_expanded -- I know this is unfortunate, because you said the data you were working with was already large.

Does rxLogit allow a cbind representation, like glm() does (I put an example as glm1b), because that would allow data to stay same size… from the rxLogit page, I'm guessing not for rxLogit, but rxGLM might allow it, given the following note in the formula page:

A formula typically consists of a response, which in most RevoScaleR functions can be a single variable or multiple variables combined using cbind, the "~" operator, and one or more predictors,typically separated by the "+" operator. The rxSummary function typically requires a formula with no response.

Does glm_2b or glm_2c in the example below work?



df_reprex <- data.frame(x = c(1, 1, 2, 2), # number of trials
                        y = c(0, 1, 0, 1), # number of successes
                        trial=c("first", "second", "third", "fourth")) # trial label

df_reprex$p <- df_reprex$y / df_reprex$x # success rate

# overall average success rate is 2/6 = 0.333, so I hope the model outputs will give this number

glm_1 <- glm(p ~ 1,
             family = binomial,
             data = df_reprex,
             weights = x)

exp(glm_1$coefficients[1]) / (1 + exp(glm_1$coefficients[1])) # overall fitted average 0.333 - correct


df_reprex_expanded <- data.frame(y=c(0,1,0,0,1,0),
                                trial=c("first","second","third", "third", "fourth", "fourth"))

## binary dependent variable
## expanded data
## no weights
glm_1a <- glm(y ~ 1,
              family = binomial,
              data = df_reprex_expanded)


exp(glm_1a$coefficients[1]) / (1 + exp(glm_1a$coefficients[1])) # overall fitted average 0.333 - correct

## cbind(success, failures) dependent variable
## compressed data
## no weights
glm_1b <- glm(cbind(y,x-y)~1,
              family=binomial,
              data=df_reprex)

exp(glm_1b$coefficients[1]) / (1 + exp(glm_1b$coefficients[1])) # overall fitted average 0.333 - correct


glm_2 <- rxLogit(p ~ 1,
                 data = df_reprex,
                 pweights = "x")

exp(glm_2$coefficients[1]) / (1 + exp(glm_2$coefficients[1])) # overall fitted average 0.167 - incorrect

glm_2a <- rxLogit(y ~ 1,
                 data = df_reprex_expanded)

exp(glm_2a$coefficients[1]) / (1 + exp(glm_2a$coefficients[1])) # overall fitted average ???

# try cbind() in rxLogit.  If no, then try rxGlm below
glm_2b <- rxLogit(cbind(y,x-y)~1,
              data=df_reprex)

exp(glm_2b$coefficients[1]) / (1 + exp(glm_2b$coefficients[1])) # overall fitted average ???

# cbind() + rxGlm + family=binomial FTW(?)
glm_2c <- rxGlm(cbind(y,x-y)~1,
              family=binomial,
              data=df_reprex)

exp(glm_2c$coefficients[1]) / (1 + exp(glm_2c$coefficients[1])) # overall fitted average ???

Logistic Regression in R: glm() vs rxGlm()

1 Answers1