2

I would like to ask for help with my project. My goal is to get ROC curve from existing logistic regression.

First of all, here is what I'm analyzing.

glm.fit <- glm(Severity_Binary ~ Side + State + Timezone + Temperature.F. + Wind_Chill.F. + Humidity... + Pressure.in. + Visibility.mi. + Wind_Direction + Wind_Speed.mph. + Precipitation.in. + Amenity + Bump + Crossing + Give_Way + Junction + No_Exit + Railway + Station + Stop + Traffic_Calming + Traffic_Signal + Sunrise_Sunset , data = train_data, family = binomial)

glm.probs <- predict(glm.fit,type = "response")

glm.probs = predict(glm.fit, newdata = test_data, type = "response")
glm.pred = ifelse(glm.probs > 0.5, "1", "0")

This part works fine, I am able to show a table of prediction and mean result. But here comes the problem for me, I'm using pROC library, but I am open to use anything else which you can help me with. I'm using test_data with approximately 975 rows, but variable proc has only 3 sensitivities/specificities values.

library(pROC)
proc <- roc(test_data$Severity_Binary,glm.probs) 

test_data$sens <- proc$sensitivities[1:975] 
test_data$spec <- proc$specificities[1:975]

ggplot(test_data, aes(x=spec, y=sens)) + geom_line()

Here´s what I have as a result:

enter image description here

With Warning message:

Removed 972 row(s) containing missing values (geom_path).

As I found out, proc has only 3 values as I said.

enter image description here

benson23
  • 16,369
  • 9
  • 19
  • 38
  • Can you provide `dput(head(train_data,10))` and `dput(head(test_data,10))`? – langtang Mar 05 '22 at 13:59
  • I posted it as answer, it is too long. – fararmaoholcezoltar Mar 05 '22 at 14:25
  • sorry, I can't work with that.. didn't realize you had all those factors.. You should delete that "answer" yourself, as it is not an answer. If you can convert all your factors to numerics, and then dput the first 6-10 rows **only of the columns in your model!** I could take another look – langtang Mar 05 '22 at 15:20

2 Answers2

2

You can't (and shouldn't) assign the sensitivity and specificity to the data. They are summary data and exist in a different dimension than your data.

Specifically, these two lines are wrong and make no sense at all:

test_data$sens <- proc$sensitivities[1:975] 
test_data$spec <- proc$specificities[1:975]

Instead you must either save them to a new data.frame, or use some of the existing functions like ggroc:

ggroc(proc)
Calimo
  • 7,510
  • 4
  • 39
  • 61
  • As i found out, even ggroc(proc) couldnt help me, result is still the same. Which showed me, proc has only 3 values for sensitivities/specificities. – fararmaoholcezoltar Mar 06 '22 at 20:24
0

If you consider what the ROC curve does, there is no reason to expect it to have the same dimensions as your dataframe. It provides summary statistics of your model performance (sensitivity, specificity) evaluated on your dataset for different thresholds in your prediction.

Usually you would expect some more nuance on the curve (more than the 3 datapoints at thresholds -Inf, 0.5, Inf). You can look at the distribution of your glm.probs - this ROC curve indicates that all predictions are either 0 or 1, with very little inbetween (hence only one threshold at 0.5 on your curve). [This could also mean that you unintentially used your binary glm.pred for calculating the ROC curve, and not glm.probs as shown in the question (?)]

This seems to be more an issue with your model than with your code - here an example from a random different dataset, using the same steps you took (glm(..., family = binomial, predict(, type = "response"). This produces a ROC curve with 333 steps for ~1300 datapoints.

PS: (Ingore the fact that this is evaluated on training data, the point is the code looks alright up to the point of generating the ROC curve)

m1 <- glm(survived ~ passengerClass + sex + age, data = dftitanic, family = binomial)
myroc <- roc(dftitanic$survived,predict(m1, dftitanic, type = "response")) 

plot(myroc)

enter image description here

pholzm
  • 1,719
  • 4
  • 11