0

I'm currently using R to do feature selection through the use of Random Forest regression. I want to split my data 70:30, which is easy enough to do. However, I want to be able to do this 10 times, with each 10 times obtaining a different set of examples from the one before.

> trainIndex<- createDataPartition(lipids$RT..seconds., p=0.7, list=F)
> lipids.train <- lipids[trainIndex, ]
> lipids.test <- lipids[-trainIndex, ]

This is what I'm doing at the moment, and it works great for splitting my data 70:30. But when I do it again , I get the same 70% of the data in my training set, and the same 30% of the data in my test data. I know this is how createDataPartition works, but is there way of making it so that I get a different 70% of the data the next time I perform it?

Thanks

user2062207
  • 955
  • 4
  • 18
  • 34
  • I haven't used `createDataPartition` but couldn't you just use `sample` to get random index values and subset for those indices? – TheComeOnMan Nov 14 '13 at 16:40
  • In the future, please include the packages you're using since `createDataPartition` is not in base R. Did you find the `times` argument? – Justin Nov 14 '13 at 16:42
  • @Codoremifa I've not come across sample, however it does seem to be the answer to my problem. Thank you! – user2062207 Nov 14 '13 at 16:45

2 Answers2

2

In the future, please include the packages you're using since createDataPartition is not in base R. I'm assuming you're using the caret package. If that is correct, did you find the times argument?

trainIndex<- createDataPartition(lipids$RT..seconds., p=0.7, list=F, times=10)

As mentioned in the comment, you can just as simply use sample:

sample(seq_along(lipids$RD..seconds), as.integer(0.7 * nrow(lipids)))

And sample will choose a different random seed each time it is run, so you will get different orders.

Justin
  • 42,475
  • 9
  • 93
  • 111
  • Will do sorry! And I am using caret yes. I have come across the times argument, however when I do:- 'lipids.train <- lipids[trainIndex, ]' I get 10 folds of 70% of my data all in one, and I don't know how to use the times argument to allow me to make a random forest model 10 times using 10 different subsets of the data, but the sample method seems to work perfectly. Thank you for your help! – user2062207 Nov 14 '13 at 16:53
0
library(dplyr)
n <- as.integer(length(data[,1])*0.7)
data_70 <- data[sample(nrow(data),n), ]
data_30 <- anti_join(data, data_70)