7

I Come from a predominantly python + scikit learn background, and I was wondering how would one obtain the cross validation accuracy for a logistic regression model in R? I was searching and surprised that there's no easy way to this. I'm looking for the equivalent:

import pandas as pd
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LogisticRegression

## Assume pandas dataframe of dataset and target exist.

scores = cross_val_score(LogisticRegression(),dataset,target,cv=10)
print(scores)

For R: I have:

model = glm(df$Y~df$X,family=binomial')
summary(model) 

And now I'm stuck. Reason being, the deviance for my R model is 1900, implying its a bad fit, but the python one gives me 85% 10 fold cross validation accuracy.. which means its good. Seems a bit strange... So i wanted to run cross val in R to see if its the same result.

Any help is appreciated!

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
John Bennet
  • 81
  • 1
  • 1
  • 2
  • the deviance n its own is not very informative, so doesn't imply a bad fit. For running the CV, why not fit it manually or have a look at the caret pkg – user20650 Sep 17 '16 at 18:16
  • 1
    Simply googling leads me immediately to either the caret package or cv.glm from the boot package. – joran Sep 17 '16 at 18:17

2 Answers2

16

R version using caret package:

library(caret)

# define training control
train_control <- trainControl(method = "cv", number = 10)

# train the model on training set
model <- train(target ~ .,
               data = train,
               trControl = train_control,
               method = "glm",
               family=binomial())

# print cv scores
summary(model)
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
  • 2
    Just to add on, summary(model) does not show you the accuracy scores. model$results does. – Wboy Sep 18 '16 at 03:28
  • 1
    To clarify, you do not need to `createDataPartition` into separate training and testing data sets because the `train_control` and `train()` functions automatically do that in the `caret` package? – coip Feb 17 '18 at 16:30
2

Below I took an answer from here and made a few changes.

The changes I made were to make it a logit (logistic) model, add modeling and prediction, store the CV's results, and to make it a fully working example.

Also note that there are many packages and functions you could use, including cv.glm() from boot.

data(ChickWeight)

df                    <- ChickWeight
df$Y                  <- 0
df$Y[df$weight > 100] <- 1
df$X                  <- df$Diet 

df     <- df[sample(nrow(df)),]
folds  <- cut(seq(1,nrow(df)),breaks=10,labels=FALSE)
result <- list()

for(i in 1:10){
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData    <- df[testIndexes, ]
  trainData   <- df[-testIndexes, ]
  model       <- glm(Y~X,family=binomial,data=trainData)
  result[[i]] <- predict(model, testData) 
}
result

You could add a line to calculate accuracy within the loop or just do it after the loop completes.

Community
  • 1
  • 1
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • Is this suitable for logistic models? – Emmanuel Goldstein May 20 '21 at 14:38
  • 1
    @EmmanuelGoldstein Yes, I would have probably just used some package if I wrote this now days but if you're asking if CV is suitable for logistic models then definitely. They are highly suitable for ML stats and best practices as a classifier. – Hack-R May 21 '21 at 16:49
  • 1
    Thanks. how would you proceed with the results stored in [i]? – Emmanuel Goldstein May 21 '21 at 17:17
  • I think this loop is lacking some form of parameter update rule, cause at the moment you fit the new model in every loop – kikatuso Jul 27 '22 at 14:52
  • @kikatuso That's a way of updating the parameters, no? Perhaps you can explain the distinction you're making a little bit more. In general I expect a CV model to be refit for each partition of the data. – Hack-R Jul 27 '22 at 16:09