I have a dataframe of the form shown below. The cases have been pre-clustered into subgroups of varying populations, including singletons. I am trying to write some code that will sample (without replacement) any specified number of rows from the dataframe, but spread as evenly as possible across clusters.
> testdata
Cluster Name
1 1 A
2 1 B
3 1 C
4 2 D
5 3 E
6 3 F
7 3 G
8 3 H
9 4 I
10 5 J
11 5 K
12 5 L
13 5 M
14 5 N
15 6 O
16 7 P
17 7 Q
For example, if I ask for a sample of 3 rows, I would like to pull a random row from a random 3 clusters (i.e. not first rows of clusters 1-3 every time, though this is one valid outcome).
Acceptable examples:
> testdata_subset
Cluster Name
1 1 A
5 3 E
12 5 L
> testdata_subset
Cluster Name
6 3 F
14 5 N
15 6 O
Incorrect example:
> testdata_subset
Cluster Name
6 3 F
8 3 H
13 5 M
The same idea applies up to a sample size of 7 in the example data shown (1 per cluster). For higher sample sizes, I would like to draw from each cluster evenly as far as possible, then evenly across the remaining clusters with unsampled rows, and so on, until the specified number of rows has been sampled.
I know how to sample N rows indiscriminately:
testdata[sample(nrow(testdata), N),]
But this pays no regard to the clusters. I also used plyr
to randomly sample N rows per cluster:
ddply(testdata,"Cluster", function(z) z[sample(nrow(z), N),])
But this fails as soon as you ask for more rows than there are in a cluster (i.e. if N > 1). I then added an if/else statement to begin to handle that:
numsamp_per_cluster <- 2
ddply(testdata,"Cluster", function(z) if (numsamp_per_cluster > nrow(z)){z[sample(nrow(z), nrow(z)),]} else {z[sample(nrow(z), numsamp_per_cluster),]})
This effectively caps the sample size asked for to the size of each cluster. But in doing so, it loses control of the overall sample size. I am hoping (but starting to doubt) there is an elegant method using dplyr
or similar package that can do this kind of semi-randomised sampling. Either way, I am struggling to tie these elements together and solve the problem.