1

I'm attempting to split a data frame into training and test sets using createDataPartition in R, with the training set having 60% of the data. When I ran this code and looked at the resulting objects, SF.training_2 had all of the observations and SF_test.2 had none. Help? I was also getting an error message that the summary command wasn't recognized even though I had run it successfully elsewhere in my code, which I had found confusing/concerning.

inTrain <- createDataPartition(
  y = paste(data_train_test$Rooms, 
            data_train_test$crime_nn5, 
            data_train_test$nhood, 
            data_train_test$BLDGSQFT, 
            data_train_test$estimate),
  p = .60, 
  list = FALSE)

SF.training_2 <- data_train_test[inTrain,]

summmary(SF.training_2)

SF.test_2 <- data_train_test[-inTrain,]
popcorn
  • 31
  • 3
  • You need to provide more information. If you are using functions that are not in base R, you must include your code showing what packages you are using. The function `createDataPartition` is not in base R. Also give us some of your data using `dput()` so we can run your code. My first guess would be that you are specifying a vector with too many groups (`y=paste(....)`). What does `table(y=paste(....))` give you? – dcarlson Oct 18 '19 at 03:44

1 Answers1

0

It seems that you use the caret and Tidyverse library. In order to help you, we need some data example. Let's create a fictitious dataset:

library(caret)
library(tidyverse)
data_train_test <- data.frame(Rooms c("a","b","c","a","b","c","a","b","c","a"),
                          crime_nn5 = c(2,3,4,2,3,2,3,2,3,4), nhood = c("Alvem","Rhye","Huttons","Rhye","Olan","Alvem","Olan","Huttons","Alvem","Rhye"),
                          BLDGSQFT = c(400,600,660,480,590,480,510,500,700,570),
                          estimate = c(34000, 55000, 60000, 37000, 50000, 45000, 48000, 51000, 80000, 52000))

Now you want to make a data partition. As you can read in the documentation (https://cran.r-project.org/web/packages/caret/caret.pdf), "y" must be a vector of outcomes, but in your code it is not. By the way the summary function you say gives you an error message has a typo, it was written "summmary".

inTrain <- createDataPartition(data_train_test$Rooms, times = 1, p = 0.6, list = FALSE)

SF.training_2 <- data_train_test[inTrain,]

summary(SF.training_2)

SF.test_2 <- data_train_test[-inTrain,]

This code should work for you. Please, don't forget to provide a minimal reproducible data example, in that way we can help you better.

Regards,

Alexis

Alexis
  • 2,104
  • 2
  • 19
  • 40