How to predict with a regression model with many missing values?

Question

I intend to analyze and build a regression model with a dummy variable as a dependent variable. I'm using the glm function, but I can't predict it. I don't want to exclude the missing values. What is the best way to make good predictions in cases where the database has many missing values?

n$status <-as.factor(n$status)

set.seed(900)

training.samples <- n$status %>%

  createDataPartition(p = 0.8, list = FALSE)

train.data  <- n[training.samples, ]

test.data <- n[-training.samples, ]

model=glm(status ~.,data = train.data,family = binomial(link = "logit"))

m <- data.frame(x1=mean(n$x1,na.rm=T),x2=mean(n$x2,na.rm=T))

m$predictprob <- predict(model, newdata=m, type="response")

Error in eval(predvars, data, env) : object 'x1' not found

When I try to make the forecast this error appears. I think it must be because of the missing values.

str(n) 
'data.frame': 4371 obs. of 8 variables: 
$ status: Factor w/ 2 levels "Active","Inactive": 1 1 1 1 1 1 1 1 1 1 ... 
$ x1 : num 12.2 12.4 13.1 10.9 22.7 ... 
$ x2 : num 4.27 2.17 5.91 5.81 7.44 ... 
$ x3 : num 8.3 7.71 12.41 9.34 19.57 ... 
$ x4 : num 2.91 1.34 5.61 4.99 6.43 ... 
$ x5 : num 4.51 1.83 9.11 10.68 14.23 ... 
$ x6 : num 3.7 4.94 12.27 11.29 15.13 ... 
$ x7 : num 2.22 3.4 1.12 0.84 1.11 4.07 8.15 0.79 8.16 8.86 ..

dput(train.data[1:10,])
structure(list(Status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("Active", "Inactive"), class = "factor"), 
    x1 = c(12.17, 12.41, 13.07, 10.88, 22.66, 43.54, 64.75, 
    255.43, 10.05, 1.84), x2 = c(4.27, 2.17, 5.91, 5.81, 7.44, 
    17.17, 22.51, 9.29, 0.78, 0.42), x3 = c(8.3, 7.71, 12.41, 
    9.34, 19.57, 33.7, 48.1, 252.75, 6.89, 2.24), x4 = c(2.91, 
    1.34, 5.61, 4.99, 6.43, 13.29, 16.72, 9.19, 0.53, 0.51), 
    x5 = c(4.51, 1.83, 9.11, 10.68, 14.23, 8.99, 7.94, 19.73, 
    1.09, 0.2), x6 = c(3.7, 4.94, 12.27, 11.29, 15.13, 9.07, 
    7.94, 21.21, 0.96, 0.02), x7 = c(2.22, 3.4, 1.12, 0.84, 
    1.11, 4.07, 8.15, 0.79, 8.16, 8.86), row.names = c(NA, 10L), class = 
    "data.frame")

Do you only have the two predictors in the model? Does the `glm` model evaluate without errors? Are the variables definitely called `x1` and `x2`? (ps `n$status <-as.factor$status` should probably be `n$status <-as.factor(n$status)`) — user20650, Jul 28 '20 at 14:45
after you get this sorted you may want to look at imputation (if it is suitable): https://stefvanbuuren.name/mice/ — user20650, Jul 28 '20 at 14:48
I have 5 more variables but I put it this way to make an example. The glm runs perfectly. But I can't predict, the name of this variable always appears. — Luís Vasconcelos, Jul 28 '20 at 15:22
As suggested by comment above, multiple imputation is the only way to go. But before doing one of the many types of multiple imputation I suggest investigating the pattern of missingness. There are good packages for this like: naniar, VIM, mice. Then do multiple imputation accordingly (most used packages are Amelia and mice, but there are other great ones). — Claudiu Papasteri, Jul 28 '20 at 15:27
re "I have 5 more variables ... ". You need to add a value for all predictors into the predict statement. But for more help you will likely need to share an example of your data. As a start could you add the results of `str(n)` to your question please — user20650, Jul 28 '20 at 15:27
I tried mice but I don't know how to predict with a factor variable, I only know with a numeric one. — Luís Vasconcelos, Jul 28 '20 at 15:29
How can I add a value for all predictors into the predict statement? — Luís Vasconcelos, Jul 28 '20 at 15:33
Can you add the details to your questions please by clicking on edit -- it makes it a lot easier to read. Thanks — user20650, Jul 28 '20 at 15:34
re "How can I add a value f..." ; you add it the same way as you added it for x1 and x2 — user20650, Jul 28 '20 at 15:35
str(n) 'data.frame': 4371 obs. of 8 variables: $ status: Factor w/ 2 levels "Active","Inactive": 1 1 1 1 1 1 1 1 1 1 ... $ x1 : num 12.2 12.4 13.1 10.9 22.7 ... $ x2 : num 4.27 2.17 5.91 5.81 7.44 ... $ x3 : num 8.3 7.71 12.41 9.34 19.57 ... $ x4 : num 2.91 1.34 5.61 4.99 6.43 ... $ x5 : num 4.51 1.83 9.11 10.68 14.23 ... $ x6 : num 3.7 4.94 12.27 11.29 15.13 ... $ x7 : num 2.22 3.4 1.12 0.84 1.11 4.07 8.15 0.79 8.16 8.86 ... — Luís Vasconcelos, Jul 28 '20 at 15:40
Thanks for the details. Given the code you have shared, your regression model, `model`, should have seven predictors. When you create the dataframe, `m`, to pass to `newdata` in the `predict` function you need to pass a value, or values, for each predictor in the original model. So you will need values for `x1` though to `x7`. — user20650, Jul 28 '20 at 15:45
Thanks a lot. Like this: m <- data.frame(x1=mean(n$x1,na.rm=T),x2=mean(n$x2,na.rm=T),x3=mean(n$x2,na.rm=T),x4=mean(n$x2,na.rm=T),x5=mean(n$x2,na.rm=T),x6=mean(n$x2,na.rm=T),x7=mean(n$x2,na.rm=T))? — Luís Vasconcelos, Jul 28 '20 at 16:11
Did you look at `m` -- are there values for each variable? Can you edit your question (there is a button that says **edit** bottom left of your question) with the results of `dput(train.data[1:10,])` please. — user20650, Jul 28 '20 at 16:33
Thanks. If I use your dput `n` as the data in your model and run `model = glm(Status ~. , data = n, family = binomial(link = "logit")); m = as.data.frame(lapply(n[-1], mean, na.rm=TRUE)); m$predictprob = predict(model, newdata=m, type="response")` it execites as expected. Things to note, your data has `Status` but the code in your question has `status`. Also your dput got a bit mangled so I had to tweak the format to get it to run. — user20650, Jul 28 '20 at 20:17
Thank you very much. It no longer gives the error. I am doing this with a colleague so he has Status and I have status, this is a little confusing because we are both responding, I am sorry. Thank you for the help! — Luís Vasconcelos, Jul 29 '20 at 13:30

How to predict with a regression model with many missing values?

0 Answers0