5

I have a very big data set (ds). One of its columns is Popularity, of type factor ('High' / ' Low').

I split the data to 70% and 30% in order to create a training set (ds_tr) and a test set (ds_te).

I have created the following model using a Logistic regression:

mdl <- glm(formula = popularity ~ . -url , family= "binomial", data = ds_tr )

then I created a predict object (will do it again for ds_te)

y_hat = predict(mdl, data = ds_tr - url , type = 'response')

I want to find the precision value which corresponds to a cutoff threshold of 0.5 and find the recall value which corresponds to a cutoff threshold of 0.5, so I did:

library(ROCR)
pred <- prediction(y_hat, ds_tr$popularity)
perf <- performance(pred, "prec", "rec")

The result is a table of many values

str(perf)

Formal class 'performance' [package "ROCR"] with 6 slots
  ..@ x.name      : chr "Recall"
  ..@ y.name      : chr "Precision"
  ..@ alpha.name  : chr "Cutoff"
  ..@ x.values    :List of 1
  .. ..$ : num [1:27779] 0.00 7.71e-05 7.71e-05 1.54e-04 2.31e-04 ...
  ..@ y.values    :List of 1
  .. ..$ : num [1:27779] NaN 1 0.5 0.667 0.75 ...
  ..@ alpha.values:List of 1
  .. ..$ : num [1:27779] Inf 0.97 0.895 0.89 0.887 ...

How do I find the specific precision and recall values corresponding to a cutoff threshold of 0.5?

user2314737
  • 27,088
  • 20
  • 102
  • 114
user2878881
  • 307
  • 4
  • 13

1 Answers1

1

Acces the slots of performance object (through the combination of @ + list)

We create a dataset with all possible values:

probab.cuts <- data.frame(cut=perf@alpha.values[[1]], prec=perf@y.values[[1]], rec=perf@x.values[[1]])

You can view all associated values

probab.cuts

If you want to select the requested values, it is trivial to do:

tail(probab.cuts[probab.cuts$cut > 0.5,], 1)

Manual check

tab <- table(ds_tr$popularity, y_hat > 0.5)
tab[4]/(tab[4]+tab[2]) # recall
tab[4]/(tab[4]+tab[3]) # precision
PereG
  • 1,796
  • 2
  • 22
  • 23
  • Thanks, but i still have a problem when trying to evaluate the model on the test set (ds_te) since y_hat is different in length then ds_te$popularity. any thoughts? – user2878881 Jan 04 '16 at 18:07
  • In fact, it is more correct to assess the model use the test data. So, estimate and use: "y_hat_test <- predict(mdl, data = ds_te - url , type = 'response')" and calculate "pred" and "perf" with the new data. Finally, use the code of this answer with "ds_te$popularity" and the new "y_hat_test" in the table function. – PereG Jan 04 '16 at 21:00