0

I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).

I have two vectors actual and predicted as shown:

> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)

I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say: 0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9

Now, I've to bin the values of 'predicted' also into the above mentioned buckets. I tried to achieve this using the cut() function in R:

binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)

However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval. I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?

Sailesh
  • 115
  • 2
  • 10
  • `cut(x=predicted, breaks=binCount)` divides the range of ‘x’ into binCount intervals, equally-sized (type `summary(predicted)` to see that range). Since `predicted` includes 98.01, by definition it will be in an interval (and since it's the max value, it will be in interval 5). **I think you wanted to constrain the breaks of `binPred` to be the same as `binActual` i.e. 5 bins from 0.00 up to 41.00.** Then in which bin should ‘cut(predicted, ...)` put values > 41? Should it give NA? an open interval on the right for >= 41? A separate extra bin `[41,∞)`? You need to decide what result you want. – smci Sep 17 '18 at 02:30

2 Answers2

2

I'm not 100% sure of what you want to do.

However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.

If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.

Then you create a new set of classes for your vector predicted. The classification is not the same anymore.

Assuming that I understood what you want to do, I'd rather write :

actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)

temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)

binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))

It gives :

> binActual
 [1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4

> binPred
 [1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4

I'm not sure it is what you're looking for, so let me know, I might be able to help you. Best wishes.

probaPerception
  • 581
  • 1
  • 7
  • 19
0

Is this what you want?

intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))

cutFixed <- function(x, intervals) {
    sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}

This gives the following result

> cutFixed(actual, intervals)
 [1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
 [1]  1 NA  1  1  7  1  1  1  1  1  1  3  1  2  1  1  1  2  1
Lars Lau Raket
  • 1,905
  • 20
  • 35