Risk assessment models in R, in order to get the probability of specif levels of a factor

Question

I am working as a risk analyst, my boss assigned me a task which I don't know how to do.

Right now I want to get the probability under some specific conditions. For example, the data would look like this

sex      hair_color Credit_Score Loan_Status
"Male"    "Red"      "256"        "bad"        
"Female"  "black"    "133"        "bad"        
"Female"  "brown"    "33"         "bad"        
"Male"    "yellow"   "123"        "good"

So we want to predict the Loan_Status for each customer. What I can do is treat "sex", "hair_color", "credit_score" as factors. and put these into the glm() in R.

But my boss wants to know "if a new customer who is male, red hair, what's the probability his loan status will be 'good'?"

or "What's the probability of male customers' loan status become 'good'?"

What kind of methods should I use? How to get the probability? I'm thinking about marginal distributions, but I don't know would this work or how can I compute it.

I hope I made this question easy to understand, and for who will help me, thank you very much for your time

If you are "working as a risk analyst" you should know how to do something related to risk estimation. What do you know how to do? — IRTFM, Nov 09 '17 at 16:59

You-leee · Accepted Answer · 2017-11-09T18:55:56.410

1

I think this tutorial fits your problem perfectly: http://www.theanalysisfactor.com/r-tutorial-glm1/

If you use it on you data, it would look something like this:

sex <- factor(c("m", "f", "f", "m"))
hair_color <- factor(c("red", "black", "brown", "yellow"))
credit_score <- c(256, 133, 33, 123)
loan_status <- factor(c("b", "b", "b", "g"))

data <- data.frame(sex, hair_color, credit_score, loan_status)

model <- glm(formula = loan_status ~ sex + hair_color + credit_score, 
         data = data, 
         family = "binomial")

predict(object = model, 
    newdata = data.frame(sex = "f", hair_color = "yellow", credit_score =     100),
    type = "response")

edited Nov 09 '17 at 18:55

answered Nov 09 '17 at 17:41

You-leee

550
3
7

Thank you so much for your help! but what if I only need when sex is "f", should I use something like marginal distributions? – DIoo Nov 09 '17 at 19:45
I don't really get the question. This model above is trained with both male and female examples, since the sex is a factor in getting the right loan status. If you want to predict for only females, you simply add only inputs to the predict function, which are coming from females. If you don't want the model to be influenced by the sex and/or train it with only female examples, you won't need the sex variable. I would suggest, that you do some research on how the glm model/function works, that will make things more clear. Hope this helped! – You-leee Nov 09 '17 at 20:20
Thank you very much again! How about let's change the question ,based on the data, how could we know the best combo for getting a "good", for example, maybe female with yellow and credit score 100 has the highest probability to get a "good", if we have a lot of categorical variables, how can we decide the best combination? – DIoo Nov 09 '17 at 21:29
You have to optimize the inverse log odds (look up the binomial function) of the linear equasion, which the glm has estimated the weights for. Just call `summary(model)` and you will see the estimated values for intercept and the weights (coefficients). So you have to optimize a function like this: probability = 1/(1 + exp(-(w0+ w1*x1 + w2*x2 + ... + wn*xn))), where w1..wn are the estimated weights, with w0 as intercept – You-leee Nov 10 '17 at 00:32
This will maybe helpful for you: https://stats.stackexchange.com/questions/20835/find-the-equation-from-generalized-linear-model-output – You-leee Nov 10 '17 at 00:34

Risk assessment models in R, in order to get the probability of specif levels of a factor

1 Answers1