3

I have a dataframe of the form shown below. The cases have been pre-clustered into subgroups of varying populations, including singletons. I am trying to write some code that will sample (without replacement) any specified number of rows from the dataframe, but spread as evenly as possible across clusters.

> testdata
   Cluster Name
1        1    A
2        1    B
3        1    C
4        2    D
5        3    E
6        3    F
7        3    G
8        3    H
9        4    I
10       5    J
11       5    K
12       5    L
13       5    M
14       5    N
15       6    O
16       7    P
17       7    Q

For example, if I ask for a sample of 3 rows, I would like to pull a random row from a random 3 clusters (i.e. not first rows of clusters 1-3 every time, though this is one valid outcome).

Acceptable examples:

> testdata_subset
   Cluster Name
1        1    A
5        3    E
12       5    L 

> testdata_subset
   Cluster Name
6        3    F
14       5    N
15       6    O

Incorrect example:

> testdata_subset
   Cluster Name
6        3    F
8        3    H
13       5    M

The same idea applies up to a sample size of 7 in the example data shown (1 per cluster). For higher sample sizes, I would like to draw from each cluster evenly as far as possible, then evenly across the remaining clusters with unsampled rows, and so on, until the specified number of rows has been sampled.

I know how to sample N rows indiscriminately:

testdata[sample(nrow(testdata), N),]

But this pays no regard to the clusters. I also used plyr to randomly sample N rows per cluster:

ddply(testdata,"Cluster", function(z) z[sample(nrow(z), N),])

But this fails as soon as you ask for more rows than there are in a cluster (i.e. if N > 1). I then added an if/else statement to begin to handle that:

numsamp_per_cluster <- 2

ddply(testdata,"Cluster", function(z) if (numsamp_per_cluster > nrow(z)){z[sample(nrow(z), nrow(z)),]} else {z[sample(nrow(z), numsamp_per_cluster),]})

This effectively caps the sample size asked for to the size of each cluster. But in doing so, it loses control of the overall sample size. I am hoping (but starting to doubt) there is an elegant method using dplyr or similar package that can do this kind of semi-randomised sampling. Either way, I am struggling to tie these elements together and solve the problem.

Joe
  • 8,073
  • 1
  • 52
  • 58
ADV
  • 33
  • 5
  • At a first glance, your question seems similar to a question I asked a while ago: [Stratified sampling with restrictions: fixed total size evenly partitioned among groups](http://stackoverflow.com/questions/35800181/stratified-sampling-with-restrictions-fixed-total-size-evenly-partitioned-among) – Henrik Oct 21 '16 at 16:11
  • It is similar. Very different solutions, interestingly. If I had chosen 'partitioning' as one of my googling terms I might have found it. Worth linking the two pages. – ADV Oct 22 '16 at 11:06

3 Answers3

1

The strategy: First, you randomly assign the order inside each cluster. This value is stored in the inside variable below. Next, you randomly select the order of the first choices of each cluster and so on (outside variable). Finally, you order your dataframe selecting the first choices, then the second and so on of each cluster, breaking the ties with the outside variable. Something like that:

set.seed(1)
inside<-ave(seq_along(testdata$Cluster),testdata$Cluster,FUN=function(x) sample(length(x)))
outside<-ave(inside,inside,FUN=function(x) sample(seq_along(x)))
testdata[order(inside,outside),]   
#   Cluster Name
#10       5    J
#15       6    O
#4        2    D
#5        3    E
#9        4    I
#16       7    P
#1        1    A
#13       5    M
#3        1    C
#17       7    Q
#7        3    G
#6        3    F
#14       5    N
#2        1    B
#12       5    L
#8        3    H
#11       5    K

Now, selecting the first n rows of the resulting data.frame you get the sample you are looking for.

nicola
  • 24,005
  • 3
  • 35
  • 56
  • This works. Treating it as a list order problem was the key, I think. Much appreciated. – ADV Oct 22 '16 at 11:06
0

Base R option: You can randomly sample from unique values of a cluster, and then use those to randomly sample names? Not very elegant but can be defined in a function. N is the number of samples you want to draw from "cluster".

sampler <- function(df,n){
  s <- sample(unique(df[,1]),n)
  n <- sapply(s, function(x) sample(df[which(df[,1]==x),2],1,replace=F))
  data.frame(cluster = s, name = n)
}

> sampler(testdata,6)
  cluster name
1       4    I
2       2    D
3       6    O
4       1    A
5       7    Q
6       5    K
paqmo
  • 3,649
  • 1
  • 11
  • 21
  • 1
    This works great for n <= 7, i.e. as long as there are still rows to sample from all clusters. But larger values of n cause `sample()` to fail - looks like it's trying to pull more rows out of a fully-sampled cluster . – ADV Oct 22 '16 at 10:45
0

Here is a function that will do the sampling for you. First, i create an index of unique elements of the list and then shuffle them. Then i order the list by the number of element in it so that i can be evenly spaced out for all the classes. I have to make a long vector out of it and choose as many elements i want.

   sample_df=function(df,iter){
    l=unique(df$Cluster)
    cluster_pos=lapply(l, function(x) which(df$Cluster==x))
    random_cluster_pos=lapply(cluster_pos, function(x) if(length(x) > 1) { sample(x) } else x)
    ## index=random_cluster_pos[rev(order(sapply(random_cluster_pos, length)))]
    index=sample(random_cluster_pos)
    inde_pos=c(t(sapply(index, "[", 1:length(index))))
    inde_pos=inde_pos[!is.na(inde_pos)]
    return(df[inde_pos[1:iter],])
}
sample_df(testdata, 3)
Chirayu Chamoli
  • 2,076
  • 1
  • 17
  • 32
  • This doesn't work for me. What's the 'mycluster' object? I can't see where that gets created. – ADV Oct 22 '16 at 10:49
  • I now see a further problem (sorry, I appreciate your efforts). While the function manages to randomly sample from within clusters, it selects from clusters in a non-random order. `sample_df(testdata,3)` will always choose a member of cluster 5, then cluster 3, then cluster 1. In this instance, I would require 3 rows from 3 separate _and randomly chosen_ clusters. Hope that's clear. – ADV Oct 22 '16 at 12:06
  • yeah i also felt that would be a problem. I ordered the list according to the length so that i get an evenly spaced out cluster. any way i have edited the function to give you random sample every time. – Chirayu Chamoli Oct 22 '16 at 12:15