R: Is it possible to plot data around a fitted model?

Question

I work as part of a sales organization that is implementing a new initiative. Essentially, we are testing whether sending an email to a potential customer makes them more likely to show up and view a 'demo' of our product. My data is comprised of ~26,000 observations of interactions with a potential customer, some of which have had a 'demo reminder' (the term for the email) sent and some which have not. Each row of data also has columns breaking down the data further (how long was the call? how many calls did the salesperson make? did the call result in a demo being successfully held? etc.).

I've generated a generalized linear model in R using the data and it appears to be a good fit. Then, using the data that had no reminders sent, I plotted a predicted graph about what would have hypothetically happened had they sent one.

Here's what my code looks like:

library(car)
library(ggplot2)

#data
demo.reminder.data <- read.csv("demo mo mixed aggregate raw.csv")

#model
demo.glm.final <- glm(Demos_Held ~ Rep_Channel + Demo_sent + Contacts + Opportunities + Vertical + Total_calls_bucket + Rep_Location, data = demo.reminder.data, family = binomial(link = "logit"))

#null model and goodness of fit
demo.null <- glm(Demos_Held ~ 1, data = demo.reminder.data, family = 'binomial')
AIC(demo.null)
AIC(demo.glm.final)

#data with no demo reminders
demo.reminder.data.none.sent <- demo.reminder.data
demo.reminder.data.none.sent$Demo_sent <- "No Demo Reminder"

#data with demo reminders
demo.reminder.data.all.sent <- demo.reminder.data
demo.reminder.data.all.sent$Demo_sent <- "Demo Reminder"


#predict probability of hold with no reminder
demo.reminder.data$none.sent.pred <- predict(demo.glm.final, newdata=demo.reminder.data.none.sent, type="response")

#predict probability of hold with reminder
demo.reminder.data$all.sent.pred <- predict(demo.glm.final, newdata=demo.reminder.data.all.sent, type="response")


demo.reminder.data$abs.lift.pred <- demo.reminder.data$all.sent.pred - demo.reminder.data$none.sent.pred

#plot 1
qplot(none.sent.pred, abs.lift.pred, data=demo.reminder.data) + xlab("Probability of Hold - No Reminder") + ylab("Increase in Probability With Reminder") + ggtitle("Effect of Demo Reminders")

#plot 2
qplot(demo.reminder.data$none.sent.pred, demo.reminder.data$all.sent.pred, data = demo.reminder.data)+ xlab("Probability of Hold - No Reminder") + ylab("Increase in Probability With Reminder") + ggtitle("Effect of Demo Reminders")

Question/Problem: When I plot this data it looks way too perfect. It essentially shows something like a 65% increase in likelihood to show up to a demo for anything under 25% of initial likelihood and my gut tells me that one email does not have this kind of power. I suspect the problem is that I'm just plotting points to the fitted model and that's why I'm seeing this perfect log curve (would attach picture but given that this is my first post, my reputation isn't high enough). I imagine that the actual data would be more diffuse with a lot more of the points being under the curve (and some above the curve).

Is there a way for me to plot around the model to show what things would actually look like?

And more importantly, I suppose, is this methodology correct? I believe it is, but I could be missing something very obvious.

Thanks in advance!

edit: got enough points to post a picture of the plot Plot 1

a good answer would take a while, but you would want to incorporate both uncertainty in the parameters and process (sampling) error. The `simulate()` method will be handy; you can do parametric bootstrapping by using `update(mymodel,data=transform(origdata,y=simulate(mymodel)))` and then `simulate()` from these models ... — Ben Bolker, Jan 27 '14 at 18:30

jlhoward · Answer 1 · 2014-01-30T04:39:29.397

Unless you provide at least a (representative) sample of your data, you are bound to get vague responses. Like this one...

First, you are throwing everything plus the kitchen sink into your model, so it may be over-specified. Did you run summary(demo.glm.final) to see which of the parameters have p<0.05? Have you looked at the statistics of the fit. In particular, using:

plot(demo.glm.final)

This will show you if the residuals are normally distributed, and if there are cases with very high leverage.

Second, did you run stepAIC(...) on demo.glm.final? This will remove unimportant parameters.

Third, you are comparing the full dataset with Demo_sent artificially set to "Demo Reminder", to the full dataset with Demo_sent artificially set to "No Demo Reminder". A better comparison might be to look at only those records where there was no reminder sent, and predict the effect of sending a reminder in those cases only:

## not tested...
test.data           <- subset(demo.reminder.data, Demo_sent=="No Reminder")
test.data$pred.no   <- predict(demo.glm.full, data=test.data, type="response")
test.data$Demo_sent <- "Reminder Sent"
test.data$pred.yes  <- predict(demo.glm.full, data=test.data, type="response")
library(ggplot2)
ggplot(test.data) + geom_line(aes(x=pred.no, y=pred.yes))

R: Is it possible to plot data around a fitted model?

1 Answers1