R split data into 2 parts randomly

Question

I am trying to split my data frame into 2 parts randomly. For example, I'd like to get a random 70% of the data into one data frame and the other 30% into other data frame. Is there a fast way to do this? The number of rows in the original data frame is over 800000. I've tried with a for loop, selecting a random number from the number of rows, and then binding that row to the first (70%) data frame using rbind() and deleting it from the original data frame to get the other (30%) data frame. But this is extremely slow. Is there a relatively fast way I could do this?

score 16 · Accepted Answer · answered Jul 01 '15 at 05:37

16

Try

n <- 100
data <- data.frame(x=runif(n), y=rnorm(n))
ind <- sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))
data1 <- data[ind, ]
data2 <- data[!ind, ]

answered Jul 01 '15 at 05:37

ExperimenteR

4,453
1
15
19

3

Beat me to it. Works quickly when extended to 800K cases too. – thelatemail Jul 01 '15 at 05:38
Works really fast. Even when I repeat it in a loop multiple times. Thank you. – gregorp Jul 01 '15 at 12:55

Workhorse · Answer 2 · 2018-06-12T12:56:36.577

7

I am building on the answer by ExperimenteR, which appears robust. One issue however is that the sample function is a bit weird in that it uses probabilities, which are not completely deterministic. Take this for example:

>sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3))

You would expect that the number of TRUE and FALSE values to be exactly 70 and 30, respectively. Oftentimes, this is not the case:

>table(sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.7, 0.3)))

 FALSE  TRUE 
    34    66

Which is alright if you're not looking to be super precise. But if you would like exactly 70% and 30%, then do this instead:

v <- as.vector(c(rep(TRUE,70),rep(FALSE,30))) #create 70 TRUE, 30 FALSE
ind <- sample(v) #Sample them randomly. 
data1 <- data[ind, ] 
data2 <- data[!ind, ]

edited Jun 12 '18 at 12:56

answered Jun 12 '18 at 00:15

Workhorse

1,500
1
17
27

2

Try `ind <- sample(c(rep(TRUE,ceiling(nrow(data)*0.7)),rep(FALSE,floor(nrow(data)*0.3))))` – moodymudskipper Jun 12 '18 at 13:19
Partially true yes, ultimately needs numbers that are factors or multiple of 100. Yours is more robust, +1. – Workhorse Jun 12 '18 at 14:02
sorry meant to say my method ultimately needs numbers that are multiples of 10. – Workhorse Jun 12 '18 at 14:21

R split data into 2 parts randomly

2 Answers2

Linked

Related