0

How can I use R to partition a dataset into N equally sized partitions? I've tried something like

    for (i in 1:100){data[i] <- full_data[i:(100000*i),]}

Which obviously doesn't work, but hopefully gives an idea of what I'm trying to accomplish. The full dataset has 1,000,000 rows and is already in random order. I'd like 100 equal and independent datasets of 10,000 rows each.

Geoffrey
  • 196
  • 1
  • 12

3 Answers3

0

that should do it, assuming data is a list:

data <- list()
for (i in 1:100){data[[i]] <- full_data[((i-1)*10000+1):(i*10000),]}
Christian Borck
  • 1,812
  • 1
  • 13
  • 19
  • Dumb question, but how are the data sets named out of that loop? I tried data12, data[12]...can't seem to find it. – Geoffrey Apr 10 '14 at 19:27
  • `data` is a list, so you get the elements by reference: `data[[1]]` (1:100) – Christian Borck Apr 10 '14 at 19:29
  • Is it possible to get the datasets as dataframes, with names like data1, data2, etc.? – Geoffrey Apr 10 '14 at 19:31
  • Sure, but in most cases this is not useful? What are your next steps with these data.frames? With the actual solution you can simply reference to every DF... – Christian Borck Apr 10 '14 at 19:33
  • Ultimately I will use each partition as input for a parameter estimation function. I will then store the output of each estimation in a table for comparison, and use the average of all 100 estimates as the final model input. – Geoffrey Apr 10 '14 at 19:37
  • Something like: data <- list() for (i in 0:99){data[[i+1]] <- full_data[(i*10000+1):((i+1)*10000),]} t.params <- list() for (i in 0:99){t.params[[i+1]] <- pnbd.EstimateParameters(data[[i+1]][which(data[[i+1]]$x<100),])} – Geoffrey Apr 10 '14 at 19:38
  • I think I don't get it. It's way more comfortable to have 100 dataframes stored in a list, than having 100 dataframes named "data1" to "data100". How will you reference to these df names? – Christian Borck Apr 10 '14 at 19:45
  • oh, and instead of `data[[i]][which(data[[i]]$x<100),]` you can just do `data[[1]][data[[1]]$x<100, ]`. no need for `which()` – Christian Borck Apr 10 '14 at 19:51
  • 1
    I see your point about using the list structure - I'll stick with that. I'll likely combine the partition loop with the estimation loop, since ultimately I'm not that interested in the datasets themselves but rather the parameter estimates each one generates. – Geoffrey Apr 10 '14 at 20:02
0

You can create quantiles-groups of index (eg you want exactly n group without having to count)

data <- data.frame(1:1000000)

xtile <- function (x, n)
    {
        cuts <- quantile(x, probs = seq(0, 1, length = n + 1))
        cut(x, breaks = cuts, include.lowest = TRUE)
    }

group <- xtile(1:nrow(data), 100)
all(table(group)== 10000)

data.spl <- split(data, group)
data.spl[[2]]
Luca Braglia
  • 3,133
  • 1
  • 16
  • 21
0

I believe the cut2() function will also partition equally, and that you can set the number of partitions with an argument.

lawyeR
  • 7,488
  • 5
  • 33
  • 63