6

In addition to predicting the class labels, is it possible to return the expectation of each observation in new data when predicting?

library(caret)
knnFit <- train(Species ~ ., data = iris, method = "knn", 
                trControl = trainControl(method = "cv", classProbs = TRUE))

x <- predict(knnFit, newdata = iris)

Returns a vector of the predicted classes.

str(x)
Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

If I want the probabilities:

x <- predict(knnFit, newdata = iris, type = "prob")
> head(x)
  setosa versicolor virginica
1      1          0         0
2      1          0         0
3      1          0         0
4      1          0         0
5      1          0         0
6      1          0         0

Is it possible to have caret return both the predictions and the probabilities? I know I can calculate by taking max.col of probabilities version but I wondered if there's a built in way to get both?

jmuhlenkamp
  • 2,102
  • 1
  • 14
  • 37
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
  • Just call `predict()` twice as you've already done. If you need a single call, write a helper function. I'm not sure I understand what the problem is here. – MrFlick Sep 06 '17 at 19:57
  • @MrFlick because I'm actually predicting on over 10M records time is a factor so ideally it would be done in a oner. Is it possible? – Doug Fir Sep 06 '17 at 20:46
  • Nope. You can view the source at `caret::predict.train`. There is clearly a if/else branch based on type. Is it really too slow to call twice? Did you time it? I mean, you could go through all the source code and hack your own but i'm sure it would be that much faster unless you choose to re implement the R functions in C++ or something. – MrFlick Sep 06 '17 at 20:57
  • Hm Ok. Well, just now I'm training in a linux screen and I reckon it might take a day or too but will report back here if predictions take a long time. They did when I experimented earlier this morning so I used a foreach loop with parallel. Still, this answers my question, just wondered if there was a parameter I could add, guess not. Cheers – Doug Fir Sep 06 '17 at 21:25
  • 1
    I once used that trick to avoid using predict twice: `predict(knnFit, newdata = iris, type = "prob") %>% mutate(names(.)[apply(., 1, which.max)])`. you can try the speed against your method.. – agenis Sep 06 '17 at 21:28
  • @DougFir hi doug did you find any solution to your pb, – agenis Oct 03 '17 at 08:27
  • @agenis I just called predict twice. I used parallel processing since it was a large dataset to predict on. I looked at your suggestion but did not follow what was happening – Doug Fir Oct 03 '17 at 11:31

2 Answers2

9

I make my comment into an answer. Once you generate your prediction table of probabilities, you don't actually need to run twice the prediction function to get the classes. You can ask to add the class column by applying a simple which.max function (which runs fast imo). This will assign for each row the name of the column (one in the three c("setosa", "versicolor", "virginica")) based on which probability is the highest.

You get this table with both informations, as requested:

library(dplyr)
predict(knnFit, newdata = iris, type = "prob") %>% 
  mutate('class'=names(.)[apply(., 1, which.max)])
# a random sample of the resulting table:
####     setosa versicolor virginica      class
#### 18       1  0.0000000 0.0000000     setosa
#### 64       0  0.6666667 0.3333333 versicolor
#### 90       0  1.0000000 0.0000000 versicolor
#### 121      0  0.0000000 1.0000000  virginica

ps: this uses the piping operator from dplyr or magrittr packages. The dot . indicates when you reuse the result from the previous instruction

agenis
  • 8,069
  • 5
  • 53
  • 102
  • 1
    works like a charm. if you also want to retrieve the value of the maximum probability, you can get it by replacing which.max() by max(): `mutate('class'=names(.)[apply(., 1, max)])` – Agile Bean Oct 17 '18 at 12:25
1

Another way to solve this:

#Generate class probabilities
y_val_probs = model.predict(x_val,return_proba = True)
#Get the list of classes from the predictor
classes = predictor.preproc.get_classes()
#convert probabilites to classes
y_val_pred = [classes[np.argmax(pred)] for pred in y_val_probs]
Jeremy Matt
  • 647
  • 1
  • 7
  • 10