1

I am new to R and I am trying to understand the solution of a logistic regression. All that is done so far is to remove unused variables, split the data into train and test datasets. I am trying t understand part of it where it talks about model.matrix. I am just getting into R and statistics and I am not sure of what is model.matrix and what is contracts. Here is the code:

## create design matrix; indicators for categorical variables (factors)
Xdel <- model.matrix(delay~.,data=DataFD_new)[,-1]
xtrain <- Xdel[train,]
xnew <- Xdel[-train,]
ytrain <- del$delay[train]
ynew <- del$delay[-train]
m1=glm(delay~.,family=binomial,data=data.frame(delay=ytrain,xtrain))
summary(m1)

Can someone please tell me the usage of model.matrix? Why cant we directly create dummy variables of categorical variables and put them in glm? I am confused. What is the usage of model.matrix?

lakshru
  • 41
  • 1
  • 1
  • 5
  • You can create dummy variables for categoricals if you want, but you usually don't need to. As long as your categorical variables are correctly coded as factors, calling `glm(y ~ catvar1 + catvar2)` will automatically use dummy-coded coefficients for each level of `catvar1` and `catvar2`, with no need to directly use `model.matrix`. – Marius Aug 18 '17 at 05:01
  • Thank you. Then what is the use of model.matrix as in the above case? – lakshru Aug 18 '17 at 05:02
  • 1
    As the documentation states, it creates a design matrix. To understand what this is, you may need to dig into the maths of modeling. You will need it if you'll be doing it by hand, but you really shouldn't. `glm` takes care of everything, plus, there's a bunch of accessory functions to make your life easier. – Roman Luštrik Aug 18 '17 at 05:04
  • Thank you. I am so new to this and I feel model.matrix is hard to understand. Could you suggest me another way to get this done without model.matrix? Could you give me syntax of it? – lakshru Aug 18 '17 at 17:20
  • @Marius' comment tells you how to do this. I'll post some code in an answer to show you how (just discovered that I can't post code within this comment...) – jruf003 Aug 21 '17 at 00:37

1 Answers1

1

Marius' comment explains how to do this - the below code just gives an example (which I felt was helpful since the poster was still confused).

# Create example dataset. 'catvar' represents a categorical variable despite being coded with numbers.
X = data.frame("catvar" = sample(c(1, 2, 3), 100, replace = T),
               "numvar" = rnorm(100), 
               "y" = sample(c(0, 1), 100, replace = T))

# Check whether you're categorical variables are coded correctly. (They'll say 'factor' if so)
sapply(X, class) #catvar is coded as 'numeric', which is wrong.

# Tell 'R' that catvar is categorical. If your categorical variables are already classed as factors, you can skip this step
X$catvar = factor(X$catvar)
sapply(X, class) # check all variables are coded correctly

# Fit model to dataframe (i.e. without needing to convert X to a model matrix)
fit = glm(y ~ numvar + catvar, data = X, family = "binomial")
jruf003
  • 980
  • 5
  • 19