How to randomly split data into three equal sizes?

Question

I have a dataset with 9558 rows from three different projects. I want to randomly split this dataset in three equal groups and assign a unique ID for each group, so that Project1_Project_2_Project3 becomes Project1, Project2 and Project3.

I have tried many things, and googled codes from people with similar problem as I have. I have used sample_n() and sample_frac(), but unfortunately I can't solve this issue myself :/

I have made an example of my dataset looking like this:

ProjectName <- c("Project1_Project2_Project3")
data <- data.frame(replicate(10,sample(0:1,9558,rep=TRUE)))
data <- data.frame(ProjectName, data)

And the output should be randomly split in three equal group of nrow=3186 and then assigned to the values

ProjectName Count of rows
Project1     3186
Project2     3186
Project3     3186

when you say split this means that you do not want repeats in the groups right? as in data in 15 is only in 1 set — Hojo.Timberwolf, Mar 27 '19 at 11:15
Does `c("Project1", "Project2", "Project3")` instead of `c("Project1_Project2_Project3")` give you what you want? — jay.sf, Mar 27 '19 at 11:18
@Hojo.Timberwolf Yes, i dont want repeats in the groups. What do you mean in 15 is only 1 set? — Rose Nonglak Seesan Jensen, Mar 27 '19 at 11:23
@jay.sf The real dataset that I have contains data from three different projects and there is only one unique ID for this and it is structured the same way as the one I made. But I would like to split it randomly into three equal groups and each group should have their own name: Project1, Project2 and Project3 :) — Rose Nonglak Seesan Jensen, Mar 27 '19 at 11:23
This question needs to be simply modified and asked in a better way to be useful for others too! — Majid, Mar 27 '19 at 11:31

jay.sf · Accepted Answer · 2019-03-29T05:45:11.180

IMO it should be sufficient to assign just random project names.

dat$ProjectName <- sample(factor(rep(1:3, length.out=nrow(dat)), 
                          labels=paste0("Project", 1:3)))

Result

head(dat)
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ProjectName
# 1  1  1  0  1  1  1  1  0  1   0    Project1
# 2  1  1  1  1  1  1  0  0  1   0    Project1
# 3  0  0  1  1  0  0  0  1  1   1    Project1
# 4  1  1  1  0  1  0  1  1  0   1    Project3
# 5  1  0  0  1  1  1  1  0  0   1    Project1
# 6  1  0  0  0  0  1  0  1  1   1    Project3

table(dat$ProjectName)
# Project1 Project2 Project3 
#     3186     3186     3186

Data

set.seed(42)
dat <- data.frame(replicate(10, sample(0:1, 9558, rep=TRUE)))

score 3 · Answer 2 · answered Mar 27 '19 at 11:26

I had this same problem once. This is how I did it. If you just use sample, the groups are uneven, by sampling off a vector where the groups are even worked for me.

sampleframe <- rep(1:3, ceiling( nrow( data)/3 ) ) 

data$grp <- 0
data[  , "grp"  ] <- sample( sampleframe , size=nrow( data) ,  replace=FALSE )

project1 <- data[data$grp %in% 1 ,]
project2 <- data[data$grp %in% 2 ,]
project3 <- data[data$grp %in% 3 ,]

score 3 · Answer 3 · answered Mar 27 '19 at 11:41

I like the solution in this comment to a Github gist.

You could generate the indices as suggested:

folds <- split(sample(nrow(data), nrow(data), replace = FALSE), as.factor(1:3))

Then get a list of 3 equal size data frames using:

datalist <- lapply(folds, function(x) data[x, ])

score 2 · Answer 4 · answered Mar 27 '19 at 11:21

Add an id to data:

data$id <- 1:nrow(data)

Take the first sample:

project1 <- dplyr::sample_frac(data, 0.33333)

Remove the used rows from data and save into project2:

project2 <- data[!(data$id %in% project1$id), ]

Sample half of the remainder:

project3 <- dplyr::sample_frac(project2, 0.5)

Finally remove those in the project3 sample from project2:

project2 <- project2[!(project2$id %in% project3$id), ]

Check all ids are unique:

# should all be FALSE
any(project1$id %in% project2$id)
any(project1$id %in% project3$id)
any(project2$id %in% project3$id)

And double-check the data frames have the right number of cases:

nrow(project1)
nrow(project2)
nrow(project3)

How to randomly split data into three equal sizes?

4 Answers4

Linked