4

I have a dataset with 9558 rows from three different projects. I want to randomly split this dataset in three equal groups and assign a unique ID for each group, so that Project1_Project_2_Project3 becomes Project1, Project2 and Project3.

I have tried many things, and googled codes from people with similar problem as I have. I have used sample_n() and sample_frac(), but unfortunately I can't solve this issue myself :/

I have made an example of my dataset looking like this:

ProjectName <- c("Project1_Project2_Project3")
data <- data.frame(replicate(10,sample(0:1,9558,rep=TRUE)))
data <- data.frame(ProjectName, data)

And the output should be randomly split in three equal group of nrow=3186 and then assigned to the values

ProjectName Count of rows
Project1     3186
Project2     3186
Project3     3186
kath
  • 7,624
  • 17
  • 32
  • when you say split this means that you do not want repeats in the groups right? as in data in 15 is only in 1 set – Hojo.Timberwolf Mar 27 '19 at 11:15
  • Does `c("Project1", "Project2", "Project3")` instead of `c("Project1_Project2_Project3")` give you what you want? – jay.sf Mar 27 '19 at 11:18
  • @Hojo.Timberwolf Yes, i dont want repeats in the groups. What do you mean in 15 is only 1 set? – Rose Nonglak Seesan Jensen Mar 27 '19 at 11:23
  • @jay.sf The real dataset that I have contains data from three different projects and there is only one unique ID for this and it is structured the same way as the one I made. But I would like to split it randomly into three equal groups and each group should have their own name: Project1, Project2 and Project3 :) – Rose Nonglak Seesan Jensen Mar 27 '19 at 11:23
  • This question needs to be simply modified and asked in a better way to be useful for others too! – Majid Mar 27 '19 at 11:31

4 Answers4

4

IMO it should be sufficient to assign just random project names.

dat$ProjectName <- sample(factor(rep(1:3, length.out=nrow(dat)), 
                          labels=paste0("Project", 1:3)))

Result

head(dat)
#   X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 ProjectName
# 1  1  1  0  1  1  1  1  0  1   0    Project1
# 2  1  1  1  1  1  1  0  0  1   0    Project1
# 3  0  0  1  1  0  0  0  1  1   1    Project1
# 4  1  1  1  0  1  0  1  1  0   1    Project3
# 5  1  0  0  1  1  1  1  0  0   1    Project1
# 6  1  0  0  0  0  1  0  1  1   1    Project3

table(dat$ProjectName)
# Project1 Project2 Project3 
#     3186     3186     3186 

Data

set.seed(42)
dat <- data.frame(replicate(10, sample(0:1, 9558, rep=TRUE)))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
3

I had this same problem once. This is how I did it. If you just use sample, the groups are uneven, by sampling off a vector where the groups are even worked for me.

sampleframe <- rep(1:3, ceiling( nrow( data)/3 ) ) 

data$grp <- 0
data[  , "grp"  ] <- sample( sampleframe , size=nrow( data) ,  replace=FALSE )

project1 <- data[data$grp %in% 1 ,]
project2 <- data[data$grp %in% 2 ,]
project3 <- data[data$grp %in% 3 ,]
MatthewR
  • 2,660
  • 5
  • 26
  • 37
3

I like the solution in this comment to a Github gist.

You could generate the indices as suggested:

folds <- split(sample(nrow(data), nrow(data), replace = FALSE), as.factor(1:3))

Then get a list of 3 equal size data frames using:

datalist <- lapply(folds, function(x) data[x, ])
neilfws
  • 32,751
  • 5
  • 50
  • 63
2

Add an id to data:

data$id <- 1:nrow(data)

Take the first sample:

project1 <- dplyr::sample_frac(data, 0.33333)

Remove the used rows from data and save into project2:

project2 <- data[!(data$id %in% project1$id), ]

Sample half of the remainder:

project3 <- dplyr::sample_frac(project2, 0.5)

Finally remove those in the project3 sample from project2:

project2 <- project2[!(project2$id %in% project3$id), ]

Check all ids are unique:

# should all be FALSE
any(project1$id %in% project2$id)
any(project1$id %in% project3$id)
any(project2$id %in% project3$id)

And double-check the data frames have the right number of cases:

nrow(project1)
nrow(project2)
nrow(project3)
Phil
  • 4,344
  • 2
  • 23
  • 33