How to take a random sample of variable with multiple observations

Question

Rookie here -- I have a large data set of about 75,000 observations and 2000 unique IDs. Therefore, each ID has about 37 observations. Now, how can I take a random sample of unique IDs, say 4, such that I have a new data frame that contains 4 random unique IDs and their corresponding observations for a total of about 150 observations?

score 7 · Accepted Answer · edited Aug 21 '14 at 18:13

7

Like this:

df <- data.frame(id = gl(2000, 37), obs = runif(74000)) # Example data set
ids <- sample(levels(df$id), 4)
df.sub <- df[df$id %in% ids, ]

edited Aug 21 '14 at 18:13

David Arenburg

91,361
17
137
196

answered Aug 21 '14 at 18:11

lukeA

53,097
5
97
100

Masato Nakazawa · Answer 2 · 2014-08-21T18:45:10.883

6

library(dplyr)

## 4 is the subsample size
d_small <- ChickWeight %>% filter(Chick %in% sample(unique(Chick), 4))

edited Aug 21 '14 at 18:45

answered Aug 21 '14 at 18:20

Masato Nakazawa

970
6
11

2

+1 - just a side note, I wouldn't include the package `nlme` which, it seems, you're only loading to use a sample data set. You could just use one of the datasets in base R so others don't need to load (and possibly install) an extra package. – talat Aug 21 '14 at 18:32
@beginneR, thanks. I have found a dataset in the base R that has multiple observations per "subject" and modified the code. – Masato Nakazawa Aug 21 '14 at 18:46

IRTFM · Answer 3 · 2014-08-21T18:53:18.787

If you don't have a targeted set of IDs, then you could pull some with:

   theseIDs <- sample( unique(sample(dset$IDs, 100) ), 4)

You could probably sample few than 100 to get a subsample but this seems unlikely to fail because of insufficient unique values.

If you are intending to construct a sample numbering 150 from a set dset$IDs that represents the distribution of 4 specific IDs whose values are, these_IDs then this is probably the simplest method:

 samp150 <- sample( dset$IDs[ dset$IDs %in% theseIDs] , 150 )

Other methods if you were considering repeating this process (or extending to other item-sets) might be to construct a table, using the function of the same name, to get probabilities and then sample with replacement from theseIDs using the probabilities from your table.

MAB · Answer 4 · 2014-08-21T22:37:38.117

0

Here's a general approach. Without seeing a portion of your data frame it's impossible to give exact instructions. If your data set it named m with a column named ID, then you can do something like

> idx <- sample(unique(m$ID), 4)
> m.reduced <- m[m$ID %in% idx, ]

edited Aug 21 '14 at 22:37

answered Aug 21 '14 at 18:12

MAB

545
2
8

2

I would add at least add a number to indicate the sample size to be selected, e.g. `idx <- sample(unique(m$ID), 4)`. – talat Aug 21 '14 at 18:29

How to take a random sample of variable with multiple observations

4 Answers4