Rookie here -- I have a large data set of about 75,000 observations and 2000 unique IDs. Therefore, each ID has about 37 observations. Now, how can I take a random sample of unique IDs, say 4, such that I have a new data frame that contains 4 random unique IDs and their corresponding observations for a total of about 150 observations?
4 Answers
Like this:
df <- data.frame(id = gl(2000, 37), obs = runif(74000)) # Example data set
ids <- sample(levels(df$id), 4)
df.sub <- df[df$id %in% ids, ]

- 91,361
- 17
- 137
- 196

- 53,097
- 5
- 97
- 100
library(dplyr)
## 4 is the subsample size
d_small <- ChickWeight %>% filter(Chick %in% sample(unique(Chick), 4))

- 970
- 6
- 11
-
2+1 - just a side note, I wouldn't include the package `nlme` which, it seems, you're only loading to use a sample data set. You could just use one of the datasets in base R so others don't need to load (and possibly install) an extra package. – talat Aug 21 '14 at 18:32
-
@beginneR, thanks. I have found a dataset in the base R that has multiple observations per "subject" and modified the code. – Masato Nakazawa Aug 21 '14 at 18:46
If you don't have a targeted set of IDs, then you could pull some with:
theseIDs <- sample( unique(sample(dset$IDs, 100) ), 4)
You could probably sample few than 100 to get a subsample but this seems unlikely to fail because of insufficient unique values.
If you are intending to construct a sample numbering 150 from a set dset$IDs that represents the distribution of 4 specific IDs whose values are, these_IDs
then this is probably the simplest method:
samp150 <- sample( dset$IDs[ dset$IDs %in% theseIDs] , 150 )
Other methods if you were considering repeating this process (or extending to other item-sets) might be to construct a table, using the function of the same name, to get probabilities and then sample with replacement from theseIDs
using the probabilities from your table.

- 258,963
- 21
- 364
- 487
Here's a general approach. Without seeing a portion of your data frame it's impossible to give exact instructions. If your data set it named m
with a column named ID
, then you can do something like
> idx <- sample(unique(m$ID), 4)
> m.reduced <- m[m$ID %in% idx, ]

- 545
- 2
- 8
-
2I would add at least add a number to indicate the sample size to be selected, e.g. `idx <- sample(unique(m$ID), 4)`. – talat Aug 21 '14 at 18:29