1

I'm trying to calculate both the predicted probability values and marginal effects values (with p-values) for a categorical variable over time in a logistic regression model in R. Basically, I want to know 1) the predicted probability of the response variable (an event occurring) in each year for sample sites in one of 2 categories and 2) the average marginal effect of a site being in 1 category vs. the other in each year. I can get predicted probability values using the ggeffects package and marginal effects values from the margins package, but I haven't figured out a way to get both sets of values from a single package.

So my questions are 1) is there a package/method to get both of these sets of values, and 2) if I get the predicted probability values from ggeffects and the marginal effects values from margins, are these values compatible? Or are there differences in the ways that the packages treat the models that mean I can't assume the marginal effects from one correspond to the predicted probabilities of the other? 3) In the margins package, how can I get the average marginal effect of the interaction of two factor variables over time? And 4) how can I get margins() to work with a large dataset?

Here is some sample data:

### Make dataset
df <- data.frame(year = rep(2001:2010, each = 100), 
                 state = rep(c("montana", "idaho", 
                               "colorado", "wyoming", "utah"),
                             times = 10, each = 20), 
                 site_id = as.factor(rep(1:100, times = 10)),
                 cat_variable = as.factor(rep(0:1, times = 5, each = 10)),
                 ind_cont_variable = rnorm(100, mean = 20, sd = 5),
                 event_occurred = as.factor(sample(c(0, 1), 
                                                    replace = TRUE, 
                                                    size = 1000)))

### Add dummy columns for states
library(fastDummies)
df <- dummy_cols(df, 
                 select_columns = "state",
                 remove_first_dummy = TRUE)


I'm interested in the effects of the state and the categorical variable on the probability that the event occurred, and in how the effect of the state and categorical variable changed over time. Here's the model:

library(lme4)
fit_state <- glmer(event_occurred ~ ind_cont_variable +
                  cat_variable*year*state +
                  (1|site_id),
                data = df, 
                family = binomial(link = "logit"),
                nAGQ = 0,
                control = glmerControl(optimizer = "nloptwrap"))

I can use ggeffects to get the predicted probability values for each state and category combination over time:

library(ggeffects)
fit_pp_state <- data.frame(ggpredict(fit_state, 
                          terms = c("year [all]",
                                    "cat_variable",
                                    "state")))

head(fit_pp_state)
### x = year, predicted = predicted probability, group = categorical variable level, facet = state
#    x predicted std.error  conf.low conf.high group    facet
# 2001 0.2835665 0.3981910 0.1535170 0.4634655     0 colorado
# 2001 0.5911911 0.3762090 0.4089121 0.7514289     0    idaho
# 2001 0.5038673 0.3719418 0.3288209 0.6779708     0  montana
# 4 2001 0.7101610 0.3964843 0.5297327 0.8420101     0     utah
# 5 2001 0.5714579 0.3747205 0.3901606 0.7354088     0  wyoming
# 6 2001 0.6788503 0.3892568 0.4963910 0.8192719     1 colorado

This is really great for visualizing the changes in predicted probability over time in the 5 states. But I can't figure out how to go from these values to estimates of marginal effects using ggeffects. Using the margins package, I can get the marginal effect of the categorical variable over time, but I'm not sure how to interpret the outputs of the two different packages together or if that's even appropriate (my first two questions). In addition, I'm not sure how to get margins to give me the marginal effect of a sample site being in each combination of categorical variable level/state over time (bringing me to my third question):

library(margins)
fit_state_me <- summary(margins(fit_state, 
                                at = list(year = 2001:2010),
                                variables = "cat_variable"))
head(fit_state_me)
#       factor      year     AME     SE       z      p   lower
# cat_variable1 2001.0000  0.0224 0.0567  0.3953 0.6926 -0.0887
# cat_variable1 2002.0000  0.0146 0.0490  0.2978 0.7659 -0.0814
# cat_variable1 2003.0000  0.0062 0.0418  0.1478 0.8825 -0.0757
# cat_variable1 2004.0000 -0.0026 0.0359 -0.0737 0.9413 -0.0731
# cat_variable1 2005.0000 -0.0117 0.0325 -0.3604 0.7186 -0.0754
# cat_variable1 2006.0000 -0.0208 0.0325 -0.6400 0.5222 -0.0845

The actual dataset I'm using is fairly large (the csv of raw data is 1.51 GB and the regression model object is 1.29 GB when I save it as a .rds file). When I try to use margins() on my data, I get an error message:

Error: cannot allocate vector of size 369.5 Gb

Any advice for getting around this issue so that I can use this function on my data?

I'd be grateful for any tips-- packages I should check out, mistakes I'm making in my code or my conceptual understanding, etc. Thank you!

kseagull
  • 33
  • 4

0 Answers0