0

I am trying to model a logistic regression with a couple of variables. I see that one of my variables has a quadratic trend, by plotting response by that variable and fitting a loess curve on it. So, I want to add a quadratic term to my logistic regression model, to model this variable with a quadratic trend. I'm having some trouble figuring out how to do this in the best / most accurate way.

Ex below:

Create df:

set.seed(1)
df <- data.frame(response = c(rep(0,times=30),rep(1,times=20)),
                 var1 = runif(50,min=12,max=30),
                 var2 = c(runif(20,min=0,max=25),runif(10,min=30,max=50),runif(20,min=15,max=40)),
                 var3 = var2^2) # note that this is just var2 squared

Plot by the second variable to view quadratic trend

ggplot(df,aes(x=var2,y=response)) +
  geom_point() +
  geom_smooth(method="loess")+
  coord_cartesian(ylim = c(0,1))

test a few different model formulas

formulas <- list(response ~ var1 + var2,              # both vars linear
                 response ~ var1 + var2 + I(var2^2),  # add quad term for var2
                 response ~ var1 + I(var2^2),         # only quad term for var2
                 response ~ var1 + var2 + var3,       # add var3, which is var2^2
                 response ~ var1 + var3)              # only var1 and var3

# build a df of some model selection criteria:
selection <-  purrr::map_df(formulas, ~{   
  mod <- glm(.x, data= df, family="binomial")
  data.frame(formula = format(.x), 
             AIC = round(AIC(mod),2), 
             BIC = round(BIC(mod),2),
             R2adj = round(DescTools::PseudoR2(mod,which=c("McFaddenAdj")),4)
  )
}) %>% arrange(desc(AIC))

view selection criteria:

> selection
                             formula   AIC   BIC  R2adj
1        response ~ var1 + I(var2^2) 65.88 71.62 0.0211
2             response ~ var1 + var2 65.26 70.99 0.0304
3      response ~ var1 + var2 + var3 64.69 72.33 0.0389
4             response ~ var1 + var3 63.18 68.91 0.0613
5 response ~ var1 + var2 + I(var2^2) 45.09 52.74 0.3300

Basically I'm wondering- can someone explain to me why these are all different? What should I be using to use one term with a quadratic pattern? Why am I getting such different results?

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Jake L
  • 987
  • 9
  • 21
  • could not reproduce your problem – Onyambu Feb 11 '20 at 00:09
  • Do you mean why do models 1 and 4 give different fit statistics from each other? As well as models 3 and 5? Btw, your model ordering in the output doesnt seem to match the model ordering when you assign the formulas list and you can't assign var3 in df like that. – Edward Feb 11 '20 at 00:15

1 Answers1

4

I get different results to you:

> selection
                             formula  AIC   BIC   R2adj
1 response ~ var1 + var2 + I(var2^2) 40.4 48.05  0.3997
2      response ~ var1 + var2 + var3 40.4 48.05  0.3997
3             response ~ var1 + var2 70.5 76.23 -0.0475
4        response ~ var1 + I(var2^2) 72.6 78.34 -0.0788
5             response ~ var1 + var3 72.6 78.34 -0.0788

Which makes sense to me. So I don't know what you did. Maybe you changed the data?

Edit: I'm thinking that you've got a floating var3 vector outside the df which is not the same as the one you think. I mean, it's not var2^2. Creating a data frame in base R is not the same as using third party packages such as dplyr, which allows you to create new variables from other variables "promised" to be created in the data frame. You should probably use the tibble function:

set.seed(1)
df <- tibble(response = c(rep(0,times=30), rep(1,times=20)),
             var1 = runif(50,min=12,max=30),
             var2 = c(runif(20,min=0,max=25), runif(10,min=30,max=50), runif(20,min=15,max=40)),
             var3 = var2^2)
Edward
  • 10,360
  • 2
  • 11
  • 26
  • 1
    OP could/should try starting from a clean R session (no reading existing `.RData` file: `rm(list=ls())` should also work for this) – Ben Bolker Feb 11 '20 at 00:44