0

I am currently taking an online Data science: Machine learning course and we are asked to fit a lm 100 times and obtain the values of the mean (rmse) and sd(rmse) for data sets of different sizes n=c(100,500,1000,5000,10000). we are asked to create a function that takes the size n and builds the dataset, then runs the loop made for fitting the 100 models, then set the seed and use a map() or sapply() function for applying our new function to the n different sizes.

The code I did is showing me "Error in dat$y : $ operator is invalid for atomic vectors" error when I run f1 This is my code:

   library(MASS)
    library(caret)


    ff=function(n){
      Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
      dat <- MASS::mvrnorm(n, c(69, 69), Sigma)%>%data.frame() %>% setNames(c("x", "y"))

    }
    set.seed(1,sample.kind = "Rounding")
    n=c(100,500,1000,5000,10000)
    f1=map(n,function(dat){
      rmse=replicate(100,{
        y <- dat$y
        test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
        train_set <- dat %>% slice(-test_index)
        test_set <- dat %>% slice(test_index)
        fit <- lm(y ~ x, data = train_set)
        y_hat <- fit$coef[1] + fit$coef[2]*test_set$x
        sqrt(mean((y_hat - test_set$y)^2))
      })
      structure(c(mean(rmse),sd(rmse)))
    })

Thank you for your help!!

1 Answers1

0

I think you should use something like :

library(caret)
library(dplyr)

n=c(100,500,1000,5000,10000)

f1= purrr::map(n,function(x){
        rmse=replicate(100,{
        dat <- ff(x)
        y <- 1:nrow(dat)
        test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
         train_set <- dat %>% slice(-test_index)
         test_set <- dat %>% slice(test_index)
         fit <- lm(y ~ x, data = train_set)
         y_hat <- fit$coef[1] + fit$coef[2]*test_set$x
         sqrt(mean((y_hat - test_set$y)^2))
        })
      c(mean(rmse),sd(rmse))
    })
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213