0

I have a data file ( 1 million rows) that has one outcome variable as Status ( Yes / no ) with three continuous variables and 5 nominal variables ( 5 categories in each variable ) I want to predict the outcome i.e status. I wanted to know which type of analysis is good for building up the model. I have seen logit, probit, logistic regression. I am confused on what to start and analyse the variables that are more likely useful for analysis.

data file: gender,region,age,company,speciality,jobrole,diag,labs,orders,status

M,west,41,PA,FPC, Assistant,code18,27,3,yes

M,Southwest,65,CV,FPC,Worker,code18,69,11,no

M,South,27,DV,IMC,Assistant,invalid,62,13,no

M,Southwest,18,CV,IMC,Worker,code8,6,1,yes

PS: Using R language. Any help would be greatly appreciated Thanks !

Malay Revanth
  • 269
  • 3
  • 3
  • If you need help with model selection, you should ask over at [stats.se] where statistical questions are on topic (it doesn't matter that you want to do this "in R"). Once you know what model to use, then you should be able to search how to do it in R. – MrFlick Aug 05 '16 at 04:19
  • Try searching for multiple regression with dummy variables, this question is better suited for cross-validation. – Waqas Aug 05 '16 at 04:31
  • Decision tree algorithms like [C5.0](https://cran.r-project.org/web/packages/C50/index.html) can be quite powerful in binary classification tasks involving a combination of continuous and nominal variables. – RHertel Aug 05 '16 at 05:37

1 Answers1

2

Given the three, most usually start their analysis with Logistic regression.

Note that, Logistic and Logit are the same thing.

While deciding between Logistic and Probit, go for Logistic.

Probit usually returns results faster, while Logistic has a better edge for interpretation result.

Now, to settle on variables - You can vary the number of variables that you are going to use in your model.

model1 <- glm(status ~., data = df, family = binomial(link = 'logit'))

Now, check the model summary and check the importance of predictor variables.

model2 <- glm(status ~ gender + region + age + company + speciality + jobrole + diag + labs, data = df, family = binomial(link = 'logit'))

With reducing the number of variables you would better be able to identify what variables are important.

Also, ensure that you have performed data cleaning prior to this.

Avoid including highly correlated variables, you can check them using cor()

Pj_
  • 824
  • 6
  • 15