5

I have a vector of values r as follows:

 r<-c(1,3,4,6,7)

and a data frame df with 20 records and two columns:

 id<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,1,4,15,16,17,18,19,20)
 freq<-c(1,3,2,4,5,6,6,7,8,3,3,1,6,9,9,1,1,4,3,7,7)
 df<-data.frame(id,freq)

Using the r vector I need to extract a sample of records (in the form of a new data frame) from df in a way that the freq values of the records, would be equal to the values I have in my r vector. Needless to say that if it finds multiple records with the same freq values it should randomly pick one of them. For instance one possible outcome can be:

   id     frequency
   12         1
   10         3
   4          4
   7          6
   8          7

I would be thankful if anyone could help me with this.

AliCivil
  • 2,003
  • 6
  • 28
  • 43

3 Answers3

6

You could try data.table

library(data.table)
setDT(df)[freq %in% r,sample(id,1L) , freq]

Or using base R

aggregate(id~freq, df, subset=freq %in% r, FUN= sample, 1L)

Update

If you have a vector "r" with duplicate values and want to sample the data set ('df') based on the length of unique elements in 'r'

  r <-c(1,3,3,4,6,7)
  res <- do.call(rbind,lapply(split(r, r), function(x) {
           x1 <- df[df$freq %in% x,]
           x1[sample(1:nrow(x1),length(x), replace=FALSE),]}))
  row.names(res) <- NULL
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Any reason to chain two `[`'s over `setDT(df)[freq %in% r,sample(id,1L) , freq]`? – Frank May 01 '15 at 14:38
  • @AliTamaddoni You can do `unique(r)` and then replace `r` in the code with that, although it will also work without the `unique(r)` – akrun May 01 '15 at 15:00
  • @akrun I meant in a way that it still keeps both 3 s and finds a random match for each. Thanks! – AliCivil May 01 '15 at 21:55
4

You can use filter and sample_n from "dplyr":

library(dplyr)
set.seed(1)
df %>% 
  filter(freq %in% r) %>% 
  group_by(freq) %>% 
  sample_n(1)
# Source: local data frame [5 x 2]
# Groups: freq
# 
#   id freq
# 1 12    1
# 2 10    3
# 3 17    4
# 4 13    6
# 5  8    7
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Thanks .. just apparently the 'dplyr' package is not available for R version 3.0.1 do you know any alternatives? – AliCivil May 01 '15 at 14:30
1

Have you tried using the match() function or %in%? This might not be a fast/clean solution, but uses only base R functions:

rUnique <- unique(r)
df2 <- df[df$freq %in% rUnique,]
x <- data.frame(id = NA, freq = rUnique) 

for (i in 1:length(rUnique)) {
    x[i,1] <- sample(df2[df2[, 2] == rUnique[i], 1], 1)
}
print(x)
hsl
  • 670
  • 2
  • 10
  • 22
  • 1
    This is not an answer (yet). – Frank May 01 '15 at 14:41
  • @hsl one question I have is what if r <-c(1,3,3,4,6,7). Can I still keep both 3s in my final data frame and find a random match for each? – AliCivil May 01 '15 at 22:01
  • If you don't use `unique(r)` you can keep both 3s (or as many 3s as you want). Just replace all the `rUnique` with your original vector `r`. However, you might end up getting similar pairs of id-freq, if that's what you want. If not, the code will be more complicated and I'm sure a simpler solution exists. – hsl May 01 '15 at 22:32