2

I have a large dataset, x, that contains replicated values, some of which are duplicated across its variables:

set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))
x_unique <- x[!duplicated(x),]

I need to sample all instances of each unique row in x a given number of times so I create a new variable that is simply a concatenation of the variables for each row:

# Another way of seeing x is as a single value - will be useful later
x_code <- do.call(paste0, x)
u_code <- x_code[!duplicated(x)]

We need a repeated sample sample from x, replicating each unique row s times. This information is provided in the vector s:

s <- rpois(n = nrow(x_unique), lambda = 0.9)

The question is, how to sample individuals from x to reach the quota set by s, for each unique row? Here's a long and unbeautiful way, that gets the right result:

for(i in 1:length(s)){
 xs <- which(x_code %in% u_code[i])
 sel <- c(sel, xs[sample(length(xs), size = s[i], replace = T)])
}

x_sampled <- x[sel, ]

This is slow to run and cumbersome to write.

Is there a way to generate the same result (x_sampled in the above) faster and more concisely? Surely there must be a way!

RobinLovelace
  • 4,799
  • 6
  • 29
  • 40
  • Just a small point, it each iteration of your `for` loop, you're overwriting your `sel` object. This is probably not what you want to be doing. – Thomas Aug 13 '14 at 07:18
  • It's not completely overwriting sel: the previous information is being concatenated by c(sel, ...). – RobinLovelace Aug 13 '14 at 08:16
  • 1
    Ah, yes, I missed that. It is better (from a speed perspective) to initialize a vector of your desired length and then populate it using `[` extraction rather than repeatedly `c` vectors together. – Thomas Aug 13 '14 at 08:31
  • OK thanks for the tip - useful as speed is critical in this case. But cannot see how your suggestion would work in this case. Please provide example code of what you mean. – RobinLovelace Aug 13 '14 at 08:44
  • Perhaps something like: `sel <- list()`. Then inside your loop store each sample in a list element with `sel[[i]] <- sample(...`. Then `sel <- unlist(sel)` after the loop. – Thomas Aug 13 '14 at 08:48
  • A couple Q: 1) Could you provide your real data dimensions? i.e., how many rows, columns and their types, and how many unique rows are there approximately? 2) Are you looking only for `base` / `dplyr` solution or other packages are acceptable as well? – Arun Aug 13 '14 at 14:08
  • 1) It's half a million rows (560966) and I need to iterate over the same process for 7021 zones, hence the need for speed. And I'm in competition with someone using a Java solution ( the 'Flexible Modelling Framework' http://eprints.ncrm.ac.uk/3177/ ). Clearly I want R to be faster! 2) I don't mind what packages are used - happy with a dplyr solution if that's fastest! – RobinLovelace Aug 13 '14 at 14:15
  • I think you missed some points there. Again: How many columns? How many unique rows (even if approx.)? What are their types? Also, I don't quite follow what you mean by "iterate over the same process for 7201 zones". I wasn't going to offer a `dplyr` solution. My guess is that `data.table` solution will be *much* faster. But I'd like to work on the real dimensions. – Arun Aug 13 '14 at 14:22
  • There are 32 variables and 1349 unique rows in the example I'm working on at the moment. `data.table` solution would be preferred if faster - could not work out how to do random sample by group in `data.table`. I can send a reproducible example: all the data used is in the public domain. By 'zones' I mean I'll need to run the process for each administrative area I'm selecting individuals for: http://stats.stackexchange.com/questions/109706/how-to-make-ipf-code-faster-and-more-concise-in-r/ – RobinLovelace Aug 13 '14 at 15:11
  • Reproducible example - try running ipfp.R. `int`, `ind_agg` and `int_agg` are the results I'm after. https://dl.dropboxusercontent.com/u/15008199/tmp/ipf-vs-co.zip – RobinLovelace Aug 13 '14 at 15:19
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/59295/discussion-between-arun-and-robinlovelace). – Arun Aug 13 '14 at 18:47

2 Answers2

2

The key to doing this efficiently is to figure out how to work with the indices, and how to vectorise as much as possible. For your problem, things get much easier if you find the indices for each repeated row:

set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))

index <- 1:nrow(x)
grouped_index <- split(index, x, drop = TRUE)
names(grouped_index) <- NULL

Then you can use Map() to combine the indices to sample from and the number of samples to take for each group. I write a wrapper around sample() to protect against the annoying behaviour when x is of length 1.

sample2 <- function(x, n, ...) {
  if (length(x) == 1) return(rep(x, n))
  sample(x, n, ...)
}

samples <- rpois(n = length(grouped_index), lambda = 0.9)
sel <- unlist(Map(sample2, grouped_index, samples, replace = TRUE))
sel
#>  [1]  66  99  99   2   6  31  90  25  42  57  14  14   8   8  12  77  60
#> [18]  17  17  92  76  76  76  70  95  36  36  36 100  91  41  41  28  69
#> [35]  69  54  54  54  54  81  64  96  35  39  29  11  74  93  82  82  24
#> [52]  46  48  48  48  51  51  73  20  37  71  71  58  16  68  94  94  94
#> [69]  80  80  80  13  13  87  87  67  67  86  49  49  88  88  52  75  47
#> [86]  89   7  79  63  78  72  72  19

If you want to keep in the original order, use sort():

sort(sel)
#>  [1]   2   6   7   8   8  11  12  13  13  14  14  16  17  17  19  20  24
#> [18]  25  28  29  31  35  36  36  36  37  39  41  41  42  46  47  48  48
#> [35]  48  49  49  51  51  52  54  54  54  54  57  58  60  63  64  66  67
#> [52]  67  68  69  69  70  71  71  72  72  73  74  75  76  76  76  77  78
#> [69]  79  80  80  80  81  82  82  86  87  87  88  88  89  90  91  92  93
#> [86]  94  94  94  95  96  99  99 100

I think the bottleneck in this code will be split(): base R doesn't have an efficient way of hashing data frames, so relies on pasting the columns together.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • But isn't there a `dplyr`ish way to do this (it does hash data.frames)? – Arun Aug 13 '14 at 14:32
  • Thanks for the method Hadley - not used Map before and seems to work nicely and about 3 times faster: `microbenchmark(MapMethod(), original_method()) Unit: milliseconds expr min lq median uq max neval MapMethod() 106.0149 109.2071 111.8960 116.2424 195.7272 100 original_method() 293.9523 301.0262 305.6481 311.8210 355.2843 100` Was wondingering if there was a dplyr way though... Also, the results are not identical: the `s` object is ordered so `sel` must be generated on data that's in the same order as `x`. – RobinLovelace Aug 13 '14 at 14:52
  • @Arun not really - it's a variation on `sample_n()`, with varying `n`. dplyr doesn't have great support for selecting rows by index yet. Next version might support `slice(x, sample(n(), rpois(1, 0.8)))` – hadley Aug 13 '14 at 18:30
  • @RobinLovelace I'm not sure I follow: my code doesn't have an `s` object. – hadley Aug 13 '14 at 18:30
1

You can use rep() to create an index vector, followed by subsetting your data using this index vector.

Try this:

idx <- rep(1:length(s), times=s)

The first few values of idx. Note how the second row gets repeated twice, while row 4 is absent:

idx
 [1]  1  2  2  3  6  7  8 10 11 13 14 14 ......

Then do the subsetting. Notice how the new duplicates have row names that indicate the replication.

x_unique[idx, ]

     X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1     1  1  0  0  0  1  0  0  1   0
2     1  0  1  0  0  1  0  0  0   0
2.1   1  0  1  0  0  1  0  0  0   0
3     1  1  0  0  1  0  0  0  1   0
6     0  0  0  0  1  1  0  0  0   0
7     0  1  1  0  1  1  0  1  1   1
8     1  1  0  1  0  0  1  1  0   0
10    0  0  1  0  1  1  1  1  0   0
....
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • Thanks for the answer, but this is not quite what I want... I would like to randomly sample from x, not x_unique. The reason is that there are other variables in x not shown. So, think of x_unique[32, ] in the example above. This is the same as x[32,], x[34, ] and x[84, ]. I want sample(c(32,34,84), s[32]) for this example, with 1/3 chance of each possibility being selected. – RobinLovelace Aug 13 '14 at 08:10
  • P.s. I believe your answer here will help, Andrie http://stackoverflow.com/questions/7950834/drawing-a-stratified-sample-in-r – RobinLovelace Aug 13 '14 at 09:15
  • @RobinLovelace Does that exactly answer your question? Is it a duplicate? – Andrie Aug 13 '14 at 09:17
  • Depends on your starting point I think - I'm starting with many different columns which I'm then converting into a single variable x_code, which is analogous to Category in the other question. Also I'm looking for a way to do it very fast, so that's different too. – RobinLovelace Aug 13 '14 at 09:39