2

Rookie here -- I have a large data set of about 75,000 observations and 2000 unique IDs. Therefore, each ID has about 37 observations. Now, how can I take a random sample of unique IDs, say 4, such that I have a new data frame that contains 4 random unique IDs and their corresponding observations for a total of about 150 observations?

Community
  • 1
  • 1
kstats9pt3
  • 799
  • 2
  • 8
  • 28

4 Answers4

7

Like this:

df <- data.frame(id = gl(2000, 37), obs = runif(74000)) # Example data set
ids <- sample(levels(df$id), 4)
df.sub <- df[df$id %in% ids, ]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
lukeA
  • 53,097
  • 5
  • 97
  • 100
6
library(dplyr)

## 4 is the subsample size
d_small <- ChickWeight %>% filter(Chick %in% sample(unique(Chick), 4)) 
Masato Nakazawa
  • 970
  • 6
  • 11
  • 2
    +1 - just a side note, I wouldn't include the package `nlme` which, it seems, you're only loading to use a sample data set. You could just use one of the datasets in base R so others don't need to load (and possibly install) an extra package. – talat Aug 21 '14 at 18:32
  • @beginneR, thanks. I have found a dataset in the base R that has multiple observations per "subject" and modified the code. – Masato Nakazawa Aug 21 '14 at 18:46
3

If you don't have a targeted set of IDs, then you could pull some with:

   theseIDs <- sample( unique(sample(dset$IDs, 100) ), 4)

You could probably sample few than 100 to get a subsample but this seems unlikely to fail because of insufficient unique values.

If you are intending to construct a sample numbering 150 from a set dset$IDs that represents the distribution of 4 specific IDs whose values are, these_IDs then this is probably the simplest method:

 samp150 <- sample( dset$IDs[ dset$IDs %in% theseIDs] , 150 ) 

Other methods if you were considering repeating this process (or extending to other item-sets) might be to construct a table, using the function of the same name, to get probabilities and then sample with replacement from theseIDs using the probabilities from your table.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

Here's a general approach. Without seeing a portion of your data frame it's impossible to give exact instructions. If your data set it named m with a column named ID, then you can do something like

> idx <- sample(unique(m$ID), 4)
> m.reduced <- m[m$ID %in% idx, ]
MAB
  • 545
  • 2
  • 8
  • 2
    I would add at least add a number to indicate the sample size to be selected, e.g. `idx <- sample(unique(m$ID), 4)`. – talat Aug 21 '14 at 18:29