3

Hi i am pretty new to programming in R and i am having troble plotting a ROC curve without using any package.

I generated my data using:

d=rpearsonIII(100,0.5,360,20)
nd=rnorm(100,450,25)

i need a vector with values <400 for d and >400 for nd, so i did:

spec = (cumsum(nd[nd>400])/sum(nd))*100
sens = (cumsum(d[d<400])/sum(nd))*100

and the i plotted like this:

plot(1-spec,sens)

but the plot was nothing like i expected it to be

Edit: Thanks to the advice given my code looks like this now:

sc2 = c(rnorm(50,450,25),rpearsonIII(50,0.5,360,20))
scF = sc2 < 395

thresholds <- sort(sc2)

pos <- sum(scF);pos
neg <- sum(!scF);neg

tn <- cumsum(!scF);tn
spec <- tn/neg;spec

tp <- pos - cumsum(scF);tp
sens <- tp/pos;sens

plot(1 - spec, sens, type = "l", col = "red", 
     ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))

The plotted roc curve looks like this: roc curve

My problem now is that if change the order of the generated data (rnorm and rpearsonIII), the curve is reversed.

xpotion
  • 47
  • 1
  • 8
  • Your lengths for spec and sens are different. They have to be equal to use plot() like you are. – Rahul Jan 20 '17 at 00:46
  • What are the scores for your data? To make and ROC curve, you need actual classes (which seems to be your `scF` value) as well as a score you gave each element to predict the class. – Barker Jan 20 '17 at 18:50
  • The scores for my data is my `sc2` value. Maybe i am misunderstanding what are the scores... – xpotion Jan 20 '17 at 21:36
  • You may want to visit [this page](http://stats.stackexchange.com/questions/126048/) for more of an explanation, but for an ROC curve, you need two things, the "truth value" (`actuals`) and some kind numeric predictor you are using to predict the truth value (`score`), usually the output of a classifier. ROC is a visualization of how well the predictor corresponds with the truth. For example you might use height as a predictor for "is minor" (age >= 18). In your code, since `scF` is derived directly from `sc2`, it would be like using age to predict "is minor". Your ROC will be perfect. – Barker Jan 20 '17 at 21:56
  • Thank you very much for your advice @Barker! One last question, could you give a suggestion on how to calculate the area under the curve (AUC), after obtaining the ROC curve using the method you described here? – xpotion Jan 21 '17 at 16:56
  • Added AUC to the answer below – Barker Jan 23 '17 at 17:34

1 Answers1

6

I don't know what rpearsonIII is, so I am just going to make a sample random data with actual classes actuals as well as the scores for the predictions scores.

set.seed(100)
actuals <- sample(c(TRUE,FALSE), 100, replace = TRUE)
scores <- runif(100,-1,1)

The long version with explanation

If in your data the actuals are strings or factors rather than logicals, you will need to convert them to logicals using:

actuals <- actuals == "postiveClass"

Next we want to order the instances based on their scores. We can do this using:

actuals <- actuals[order(scores)]

If you want to keep track of the thresholds for the sensitivities and specificity, you can keep them aligned using:

thresholds <- sort(scores)

Now we need to get our sensitivities and specificities. Sensitivity is TP/P and specificity is TN/N. Getting the total number of positives P is easy, since our actuals are logical, we can just use sum(actuals). Similarity, we can get our negatives N using sum(!actuals).

pos <- sum(actuals)
neg <- sum(!actuals)

First lets get our true negatives at each threshold. That is pretty easy, it is just the number of FALSE values at or below each the threshold. Since our data are in order by threshold, we can calculate that (and the specificity) using:

tn <- cumsum(!actuals)
spec <- tn/neg

The number of true positives is slightly harder because we are looking for the number of positives greater that the threshold, so cumsum alone won't work. However, since the number above the threshold is equal to the total minus number below or at the threshold, we can get our true positives using:

tp <- pos - cumsum(actuals)
sens <- tp/pos

Now all we need to do is plot the two.

plot(1 - spec, sens, type = "l", col = "red", 
     ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))

ROC Plot

To get the AUC of the curve, we simply need to calculate the height of the curve (the sensitivity) multiplied by the width the (difference in 1 - specificity) at each value of actuals. We already have the sensitivity, we just need the specificity. The diff function will give us our difference in adjacent values of specificity, however, we need to put a 0 value at the beginning to get the width of the first columns.

width <- diff(c(0, 1 - sens))
auc <- sum(spec*width)

the minimal code version

actuals <- actuals[order(scores)]

sens <- (sum(actuals) - cumsum(actuals))/sum(actuals)
spec <- cumsum(!actuals)/sum(!actuals)

plot(1 - spec, sens, type = "l", col = "red", 
     ylab = "Sensitivity", xlab = "1 - Specificity")
abline(c(0,0),c(1,1))

(auc <- sum(spec*diff(c(0, 1 - sens))))
Barker
  • 2,074
  • 2
  • 17
  • 31
  • I used the package PearsonDS to get the rpearsonIII because i need to generate data that has a gamma distribution with the location parameter. So the variable "d" that i used has data with gamma distribution and "nd" has normal distribution. My goal is to plot a roc curve using data generated with those distributions – xpotion Jan 20 '17 at 13:13