Selecting random portions of a dataframe

Question

My dataset is a series of surveys. Each survey is divided up into several time periods and each time period has several observations. Each line in the dataset is a single observation. It looks something like this:

Survey     Period     Observation
  1.1        1            A
  1.1        1            A
  1.1        1            B
  1.1        2            A
  1.1        2            B
  1.2        1            A
  1.2        2            B
  1.2        3            C
  1.2        4            D

This is a simplified version of my dataset, but it demonstrates the point (several periods for each survey, several observations for each period). What I want to do is make a dataframe consisting of all the observations from a single, randomly selected, period in each survey, so that in the resulting dataframe each survey only has a single period, but all of the associated observations. I'm completely stumped on this one and don't even know where to start.

Thanks for your help

You can start here: [http://stackoverflow.com/questions/25937466/splitting-dataframe-into-confirmatory-and-exploratory-samples?rq=1] — R Yoda, Nov 11 '15 at 17:48

score 2 · Accepted Answer · answered Nov 11 '15 at 18:14

If I've understood correctly, for each survey you need to randomly select one period only and then get all corresponding observations. There might alternative ways, but I'm using a dplyr approach.

dt = read.table(text="Survey     Period     Observation
                1.1        1            A
                1.1        1            A
                1.1        1            B
                1.1        2            A
                1.1        2            B
                1.2        1            A
                1.2        2            B
                1.2        3            C
                1.2        4            D", header=T)

library(dplyr)

set.seed(49)  ## just to be able to replicate the process exactly

dt %>%
  select(Survey, Period) %>%               ## select relevant columns
  distinct() %>%                           ## keep unique combinations
  group_by(Survey) %>%                     ## for each survey
  sample_n(1) %>%                          ## sample only one period
  ungroup() %>%                            ## forget about the grouping
  inner_join(dt, by=c("Survey","Period"))  ## get corresponding observations

#    Survey Period Observation
#     (dbl)  (int)      (fctr)
# 1    1.1      1           A
# 2    1.1      1           A
# 3    1.1      1           B
# 4    1.2      2           B

score 1 · Answer 2 · answered Nov 11 '15 at 18:39

You can achieve what you need in a straigth forward way using plain vanilla base R doing something like this:

out = d[0,] # make empty dataframe with similar structure.
for( survey in levels( as.factor( d$Survey ) ) ) { # for each value of survey
  # randomly choose 1 from the observed values of Period for this value of Survey:
  period = sample( d[ d$Survey == survey, ]$Period, 1 )
  # attach all rows with that survey and that period to the empty df above
  out = rbind( out, d[ d$Survey == survey & d$Period == period, ] )
}

Selecting random portions of a dataframe

2 Answers2