3

I have the below conceptual problem which I can't get my head around.

Below is an example for survey data where I have a time column that indicates how long someone needs to respond to a certain question.

Now, I'm interested in how the amount of cleaning would change based on this threshold, i.e. what would happen if I increase the threshold, what would happen if I decrease it.

So my idea was to just create a ROC curve (or other model metrics) to have a visual cue about a potential threshold. The problem is that I don't have a machine-learning-like model that would give me class probabilities. So I was wondering if there's any way to create a ROC curve nonetheless with this type of data. I had the idea of just looping through my data at maybe 100 different thresholds, calculate false and true positive rates at each threshold and then do a simple line plot, but I was hoping for a more elegant solution that doesn't require me to loop.

Any ideas?

example data:

  • time column indidates the time needed per case
  • truth column indicates my current decision I want to compare against
  • predicted column indicates the cleaning decision if I would cut at a time threshold of 2.5s. This is waht I need to change/loop through.

set.seed(3)
df <- data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
                 truth     = rep(c("cleaned", "final"), each = 5)) %>%
  mutate(predicted = if_else(time < 2.5, "cleaned", "final"))
Shibaprasadb
  • 1,307
  • 1
  • 7
  • 22
deschen
  • 10,012
  • 3
  • 27
  • 50
  • As you need to calculate the ROC point for each threshold, I don't see an alternative than some variant of looping. – CIAndrews Sep 14 '21 at 06:54

2 Answers2

4

So my idea was to just create a ROC curve

Creating a ROC curve is as easy as

library(pROC)
set.seed(3)
data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
           truth     = rep(c("cleaned", "final"), each = 5)) |>
    roc(truth, time) |>
    plot()

enter image description here

The problem is that I don't have a machine-learning-like model that would give me class probabilities.

Sorry, I do not understand what is machine-learning-like about the question.

I had the idea of just looping through my data at maybe 100 different thresholds

There is no point in looping over 100 possible thresholds if you got 10 observations. Sensible cutoffs are the nine situated in between your time values. You can get those from roc:

df <- data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
                truth     = rep(c("cleaned", "final"), each = 5))

thresholds <- roc(df, truth, time)$thresholds
print(thresholds)

or

> print(thresholds)
 [1]     -Inf 1.195612 1.739608 1.968531 2.155908 2.329745 2.561073
 [8] 3.093424 3.969994 4.586341      Inf

What exactly is implied in the term looping and whether you want to exclude just a for and a while loop or whatever exactly you consider to be a loop needs some precise definition. Is c(1, 2, 3, 4) * 5 a loop? There will be a loop running under the hood.

Bernhard
  • 4,272
  • 1
  • 13
  • 23
  • Interesting. However, I'm a bit surprised how the function determines the cleaned/final class at different thresholds, i.e. the predicted class. The function doesn't have any information on what to do e.g. at threshold 1.968531 (4th value from your thresholds example). – deschen Sep 14 '21 at 08:09
  • 1
    As for the data itself, it's just an example with 10 cases. Could be 100k in real life. And with "machine-learning like" I just meant that I need to compare the truth to some predicted" class, as is usually done in train/test settings in machine-learning. – deschen Sep 14 '21 at 08:11
  • 1
    The prediction method is implicit to the ROC: If we assumed values below a threshold to be `cleaned` and those above it to be `final`, what are sensitivity and specificity within the given sample. Draw a line for every possible threshold value. The function used by @Shibaprasadb even color codes the thresholds along the line. Dividing data into training and test data set is not inherent to ROC. Is that what you originally wanted to do? – Bernhard Sep 14 '21 at 08:31
  • 1
    "How the function determines the cleaned/final class at different thresholds" is entirely defined by the ROC algorithm. – Calimo Sep 14 '21 at 08:36
  • 1
    Thanks for this response (although I accepted the other one with the ROCR package). I also see my coneptual misunderstanding of how the package would determine the predicted class. So I guess it just uses the numeric time column and cuts at different positions and assuming that the "truth" column is setup in a way that it only has "cleaned" below a certain threshold and everything else is "final", then it can just count what happens if the threshold is shifted. – deschen Sep 14 '21 at 12:36
4

You can use ROCR too for this

library(ROCR)

set.seed(3)
df <- data.frame(time      = c(2.5 + rnorm(5), 3.5 + rnorm(5)),
                 truth     = rep(c("cleaned", "final"), each = 5)) %>%
  mutate(predicted = if_else(time < 2.5, "cleaned", "final"))

pred <- prediction(df$time, df$truth)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE)

ROC Curve

You can also check the AUC value:

auc <- performance(pred, measure = "auc")
auc@y.values[[1]]

[1] 0.92

Cross checking the AUC value with pROC

library(pROC)

roc(df$truth, df$time)

Call:
roc.default(response = df$truth, predictor = df$time)

Data: df$time in 5 controls (df$truth cleaned) < 5 cases (df$truth final).
Area under the curve: 0.92

For both the cases, it is same!

Shibaprasadb
  • 1,307
  • 1
  • 7
  • 22
  • 1
    Thanks for this answer. I like the plot version of ROCR package, so I will accept this as my preferred answer, although I usually like if the package functions can be called in a tidyverse pipe (and I didn't manage yet to get the pred/perf calculation into my pipe. – deschen Sep 14 '21 at 12:32