I intend to analyze and build a regression model with a dummy variable as a dependent variable.
I'm using the glm
function, but I can't predict it. I don't want to exclude the missing values. What is the best way to make good predictions in cases where the database has many missing values?
n$status <-as.factor(n$status)
set.seed(900)
training.samples <- n$status %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- n[training.samples, ]
test.data <- n[-training.samples, ]
model=glm(status ~.,data = train.data,family = binomial(link = "logit"))
m <- data.frame(x1=mean(n$x1,na.rm=T),x2=mean(n$x2,na.rm=T))
m$predictprob <- predict(model, newdata=m, type="response")
Error in eval(predvars, data, env) : object 'x1' not found
When I try to make the forecast this error appears. I think it must be because of the missing values.
str(n)
'data.frame': 4371 obs. of 8 variables:
$ status: Factor w/ 2 levels "Active","Inactive": 1 1 1 1 1 1 1 1 1 1 ...
$ x1 : num 12.2 12.4 13.1 10.9 22.7 ...
$ x2 : num 4.27 2.17 5.91 5.81 7.44 ...
$ x3 : num 8.3 7.71 12.41 9.34 19.57 ...
$ x4 : num 2.91 1.34 5.61 4.99 6.43 ...
$ x5 : num 4.51 1.83 9.11 10.68 14.23 ...
$ x6 : num 3.7 4.94 12.27 11.29 15.13 ...
$ x7 : num 2.22 3.4 1.12 0.84 1.11 4.07 8.15 0.79 8.16 8.86 ..
dput(train.data[1:10,])
structure(list(Status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("Active", "Inactive"), class = "factor"),
x1 = c(12.17, 12.41, 13.07, 10.88, 22.66, 43.54, 64.75,
255.43, 10.05, 1.84), x2 = c(4.27, 2.17, 5.91, 5.81, 7.44,
17.17, 22.51, 9.29, 0.78, 0.42), x3 = c(8.3, 7.71, 12.41,
9.34, 19.57, 33.7, 48.1, 252.75, 6.89, 2.24), x4 = c(2.91,
1.34, 5.61, 4.99, 6.43, 13.29, 16.72, 9.19, 0.53, 0.51),
x5 = c(4.51, 1.83, 9.11, 10.68, 14.23, 8.99, 7.94, 19.73,
1.09, 0.2), x6 = c(3.7, 4.94, 12.27, 11.29, 15.13, 9.07,
7.94, 21.21, 0.96, 0.02), x7 = c(2.22, 3.4, 1.12, 0.84,
1.11, 4.07, 8.15, 0.79, 8.16, 8.86), row.names = c(NA, 10L), class =
"data.frame")