2

Hi guys I need help truble shooting the fucntion below. I am using R language.

The dataset i am using is called wages and it is from a package called library(ISLR) data(wages).

Anyhow, I am trying to develop a function that allows me to perform k-fold cross-validation on any general linear models.

The inputs/arguments to the function i am using are function(numberOfFolds, y,x,InputData)

y is the dependent variable x is all the other variables in the dataset inputdata is the dataset of wages numberOfFolds is k basically.

I have developed the below code but i am getting NaN values. Not sure what is going on wrong! Could someone please help

my.k.fold.1<- function(numberOfFolds, y,x,inputData){
  index<-sample(1:numberOfFolds, nrow(inputData), replace = T)
  inputData$index<-index
  
  mse<-vector('numeric', length = numberOfFolds)
  for (n in 1:numberOfFolds) {
    data.train<-inputData[index!=n,]
    data.test<-inputData[index==n,]
    my.equation<-paste(y,paste(x, collapse = '+'),sep='~')
    formula.1<-formula(my.equation)
    model.test<-lm(formula.1, data = data.train)
    predictions<-predict(model.test, newdata=data.test)
    mse[[n]]<-mean((data.test$y-predictions)^2)
  }
  return(mse)
}

my.k.fold.1(numberOfFolds = 5, y='earn', x=c('race', 'sex', 'ed', 'height', 'age'), inputData = wages)

i would like to keep the arguments the same and i can write down the column names in the y and xs

Tareq
  • 31
  • 4

2 Answers2

1

This is because the y variable is a string, so data.test$y is equivalent to data.test[["y"]]. You should replace it with data.test[[y]], which is equivalent to data.test$earn if y="earn":

my.k.fold.1<- function(numberOfFolds, y,x,inputData){
  index<-sample(1:numberOfFolds, nrow(inputData), replace = T)
  inputData$index<-index
  
  mse<-vector('numeric', length = numberOfFolds)
  for (n in 1:numberOfFolds) {
    data.train<-inputData[index!=n,]
    data.test<-inputData[index==n,]
    my.equation<-paste(y,paste(x, collapse = '+'),sep='~')
    formula.1<-formula(my.equation)
    model.test<-lm(formula.1, data = data.train)
    predictions<-predict(model.test, newdata=data.test)
    mse[[n]]<-mean((data.test[[y]]-predictions)^2)
  }
  return(mse)
}
VitaminB16
  • 1,174
  • 1
  • 3
  • 17
1

Here is a general purpose function. The arguments names are self descriptive. I have added an argument verbose, defaulting to FALSE.
Tested below with built-in data set mtcars.

my.k.fold.1 <- function(numberOfFolds, inputData, response, regressors, verbose = FALSE){
  fmla <- paste(regressors, collapse = "+")
  fmla <- paste(response, fmla, sep = "~")
  fmla <- as.formula(fmla)
  index <- sample(numberOfFolds, nrow(inputData), replace = TRUE)
  mse.all <- numeric(numberOfFolds)
  for (n in seq_len(numberOfFolds)) {
    inx <- which(index != n)
    data.training <- inputData[inx, ]
    data.test <- inputData[-inx, ]
    if(verbose){
      msg <- paste("fold:", n, "nrow(training):", nrow(data.training), "nrow(test):", nrow(data.test))
      message(msg)
    }
    model <- lm(fmla, data = data.training)
    predicted <- predict(model, newdata = data.test)
    mse <- mean((data.test[[response]] - predicted)^2)
    mse.all[n] <- mse
  }
  return(mse.all)
}

X <- names(mtcars)[-c(1, 3, 5, 7)]
y <- "mpg"

set.seed(2021)
mse.kcv <- my.k.fold.1(5, mtcars, response = y, regressors = X, verbose = TRUE)
mse.kcv
#[1] 14.255583  8.355831  2.765447  7.539299 10.151655
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66