How to determine the optimum threshold for machine learning model (Random forest)

Question

I have created a Machine learning model using python. By default, Random Forest uses 0.5 as a threshold to assign "Yes" or "NO" (means if the probability of that record is more than 50% then it will assign it to YES otherwise NO).

Therefore, I just want to know how we can determine the optimum threshold of the trained model (means at what cutoff value we will get the maximum "YES") so that I can improve the performance of the model.

In R people uses a loop to determine the optimum threshold. so I wanted to know how we can do it in python.

Below is the R code for the same -

perform_fn_rf <- function(cutoff) 

{

  predicted_response <- as.factor(ifelse(rf_pred[, 2] >= cutoff, "YES", "NO"))

  conf <- confusionMatrix(predicted_response, train_validation$Outcome.Status, positive = "YES")

  acc <- conf$overall[1]

  sens <- conf$byClass[1]

  spec <- conf$byClass[2]

  OUT_rf <- t(as.matrix(c(sens, spec, acc))) 
  colnames(OUT_rf) <- c("sensitivity", "specificity", "accuracy")

  return(OUT_rf)

}

Is there anything that prevents you from doing simple cross validation of your results? — user2653663, Jul 30 '19 at 09:42
@akhetos I have tried my best but did not find anything in Python which can help to find threshold similar to the above mentioned "R" example. — Sagar Jaiswal, Jul 30 '19 at 10:52
@user2653663 I performed cross-validation using "Train_Test Split" approach. I have approx 7lakh rows and there are thousands of rows whose probability is 40-45% as per the model and model predicted it as "LOSS" (due to default threshold of random forest is 50%) but in the real scenario, these got converted into "WIN". SO I want to know the optimum cutoff value at which my model can return maximum "WIN". like few people did it in R using the loop. — Sagar Jaiswal, Jul 30 '19 at 11:14
So you want a metric that takes in a probability of a True/False response and scores the accuracy? Changing the binary selection doesn't seem to be an optimal way of doing this and cross entropy is the usual choice. See [wikipedia](https://en.wikipedia.org/wiki/Cross_entropy) or [this post](https://stackoverflow.com/questions/52134869/why-softmax-cross-entropy-with-logits-v2-return-cost-even-same-value/52135066#52135066). — user2653663, Jul 30 '19 at 12:48

score 1 · Accepted Answer · answered Jul 30 '19 at 09:13

Each feature belongs to either the true positive or the true negative group, on changing the threshold value result changes between between sensitivity and specificity, hence the ROC help here to identify. Depending on the case you choose threshold. For example if you want to predict virus outbreak than threshold should be selected such that false negative should be minimal to avoid outbreak. Please refer to ROC and AUC for this(Video).

How to determine the optimum threshold for machine learning model (Random forest)

1 Answers1