8

I'm attempting to test a logistic regression model (e.g. 3 coefficients for 3 predictor variables, X1,X2,X3), on a dataset. I'm aware of how to test a model after i created the model object using, for example,

mymodel <- glm( Outcome ~  X1 + X2 + X3 , family = binomial,data=trainDat)

and then test the data

prob <- predict(mymodel,type="response",newdata=test)

But i want to, now, create a logistic model using coefficients and intercept that I have, and then test this model on data.

Basically I'm not clear on how to create "mymodel" without running glm.

Context for the question: I've run a logistic regression using offsets eg:

mymodel <- glm(Outcome ~ offset(C1 * X1) + offset(C2 * X2) + X3, 
               family = binomial, data = trainDat)

Thus, the mymodel object generates a model with only an intercept (I) and C3 coefficients (for feature X3).
I now need to test the full model (i.e. I + C1*X1 + C2*X2 + C3*X3), on a test dataset, but I don't know how to get the full model, since the output of mymodel has only intercept and C3. So I thought my more general question was: "how do you manually build a logisitic regression model object?"

Thank you for your patience.

utapyngo
  • 6,946
  • 3
  • 44
  • 65
nerdlyfe
  • 487
  • 7
  • 21
  • So do you just want to save the `mymodel` object and reload it later so you don't have to run `glm` again? Does it take a long time to run or something? – MrFlick Jun 09 '14 at 02:21
  • And they also included the variance/covariance matrix? As well as all the factor level encodings? – MrFlick Jun 09 '14 at 02:24
  • no, no covariance matrices. yes, I know the factor level encoding. Shouldn't the covariance matrices be unnecessary?? I know you can manually make a logistic function in excel using only the coefficients and factor level rules. Here, i'll add something to my question to allow for multiple answer types. – nerdlyfe Jun 09 '14 at 02:28
  • 2
    Questions that have test data included will generate more interest. – IRTFM Jun 09 '14 at 04:34

1 Answers1

14

I could not find a simple function to do this. There is some code in the predict function that depends on having a fitted model (like determining the rank of the model). However, we can create a function to make a fake glm object that you can use with predict. Here's my first attempt at such a function

makeglm <- function(formula, family, data=NULL, ...) {
    dots <- list(...)
    out<-list()
    tt <- terms(formula, data=data)
    if(!is.null(data)) {
        mf <- model.frame(tt, data)
        vn <- sapply(attr(tt, "variables")[-1], deparse)

        if((yvar <- attr(tt, "response"))>0)
            vn <- vn[-yvar]
            xlvl <- lapply(data[vn], function(x) if (is.factor(x))
           levels(x)
        else if (is.character(x))
           levels(as.factor(x))
        else
            NULL)
        attr(out, "xlevels") <- xlvl[!vapply(xlvl,is.null,NA)]
        attr(tt, "dataClasses") <- sapply(data[vn], stats:::.MFclass)
    }
    out$terms <- tt
    coef <- numeric(0)
    stopifnot(length(dots)>1 & !is.null(names(dots)))
    for(i in seq_along(dots)) {
        if((n<-names(dots)[i]) != "") {
            v <- dots[[i]]
            if(!is.null(names(v))) {
                coef[paste0(n, names(v))] <- v
            } else {
                stopifnot(length(v)==1)
                coef[n] <- v
            }
        } else {
            coef["(Intercept)"] <- dots[[i]]
        }   
    }
    out$coefficients <- coef
    out$rank <- length(coef)
    out$qr <- list(pivot=seq_len(out$rank))
    out$family <- if (class(family) == "family") {
        family
    } else if (class(family) == "function") {
        family()
    } else {
        stop(paste("invalid family class:", class(family)))
    }
    out$deviance <- 1
    out$null.deviance <- 1
    out$aic <- 1
    class(out) <- c("glm","lm")
    out
}

So this function creates an object and passes all the values that predict and print expect to find on such an object. Now we can test it out. First, here's some test data

set.seed(15)
dd <- data.frame(
    X1=runif(50),
    X2=factor(sample(letters[1:4], 50, replace=T)),
    X3=rpois(50, 5),
    Outcome = sample(0:1, 50, replace=T)
)

And we can fit a standard binomial model with

mymodel<-glm(Outcome~X1+X2+X3, data=dd, family=binomial)

Which gives

Call:  glm(formula = Outcome ~ X1 + X2 + X3, family = binomial, data = dd)

Coefficients:
(Intercept)           X1          X2b          X2c          X2d           X3  
    -0.4411       0.8853       1.8384       0.9455       1.5059      -0.1818  

Degrees of Freedom: 49 Total (i.e. Null);  44 Residual
Null Deviance:      68.03 
Residual Deviance: 62.67    AIC: 74.67

Now let's say we wanted to try out model that we read in a publication on the same data. Here's how we use the makeglm function

newmodel <- makeglm(Outcome~X1+X2+X3, binomial, data=dd, 
    -.5, X1=1, X2=c(b=1.5, c=1, d=1.5), X3=-.15)

The first parameter is the formula of the model. This defines the response and the covariates just like you would when running glm. Next you specify the family like you would with glm(). And you need to pass a data frame so R can sniff the correct data types for each of the variables involved. This will also identify all of the factor variables and their levels using the data.frame. So this can be new data that's coded just like the fitted data.frame or it can be the original one.

Now we start to specify the coefficients to use in our model. The coefficients will be filled using the names of the parameters. The unnamed parameter will be used as the intercept. If you have a factor, you need to give coefficients to all the levels via named parameters. Here I just decided to round off the fitted estimates to "nice" numbers.

Now I can use our newmodel with predict.

predict(mymodel, type="response")
#         1         2         3         4         5
# 0.4866398 0.3553439 0.6564668 0.7819917 0.3008108

predict(newmodel, newdata=dd, type="response")

#         1         2         3         4         5
# 0.5503572 0.4121811 0.7143200 0.7942776 0.3245525

Here I call predict on the original model and on the new model using the old data with my specified coefficients. We can see the estimate of probability have changed a bit.

Now I haven't thoroughly tested this function so use at your own risk. I don't do as much error checking as I probably should. Maybe someone else does know of a better way.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • That was an amazing effort, and I am in your debt. I greatly appreciate this and know others will too. I'll use this and review it and try to add some further insights below. – nerdlyfe Jun 09 '14 at 16:44
  • 1
    I've also the posted the code as a [gist](https://gist.github.com/MrFlick/ae299d8f3760f02de6bf). I'll post any updates I make to the code later there. – MrFlick Jun 09 '14 at 19:54