0

I have a very large dataframe (N = 107,251), that I wish to split into relatively equal halves (~53,625). However, I would like the split to be done such that three variables are kept in equal proportion in the two sets (pertaining to Gender, Age Category with 6 levels, and Region with 5 levels).

I can generate the proportions for the variables independently (e.g., via prop.table(xtabs(~dat$Gender))) or in combination (e.g., via prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)), but I'm not sure how to utilise this information to actually do the sampling.

Sample dataset:

set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)

Probabilities:

round(prop.table(xtabs(~dat$Gender)), 3)  # 48.5% Female; 51.5% Male
round(prop.table(xtabs(~dat$Age)), 3)     # 16.8, 18.2, ..., 16.0%
round(prop.table(xtabs(~dat$Region)), 3)  # 21.5%, 17.7, ..., 21.9%
# Multidimensional probabilities:
round(prop.table(xtabs(~dat$Gender + dat$Age + dat$Region)), 3)

The end goal for this dummy example would be two data frames with ~500 observations in each (completely independent, no participant appearing in both), and approximately equivalent in terms of gender/region/age splits. In the real analysis, there is more disparity between the age and region weights, so doing a single random split-half isn't appropriate. In real world applications, I'm not sure if every observation needs to be used or if it is better to get the splits more even.

I have been reading over the documentation from package:sampling but I'm not sure it is designed to do exactly what I require.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Twitch_City
  • 1,236
  • 1
  • 10
  • 22
  • 1
    See `createDataPartition` in the `caret` package and also [this SO question](http://stackoverflow.com/questions/16493920/how-can-i-ensure-that-a-partition-has-representative-observations-from-each-leve). – eipi10 Sep 19 '14 at 16:03

2 Answers2

2

The following code basically creates a key based on the group membership then loops through each group, sampling half to one set and half (roughly) to the other. If you compare the resulting probabilities they are within 0.001 of each other. The downside to this is that its biased to make a larger sample size for the second group due to how rounding of odd-numbered group member number is handled. In this case the first sample is 488 observations and the second is 512. You can probably throw in some logic to account for that and even it out better.

EDIT: Added that logic and it split it up evenly.

set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)

dat$group <- with(dat, paste(Gender, Region, Age))
groups <- unique(dat$group)
setA <- dat[NULL,]
setB <- dat[NULL,]
for (i in 1:length(groups)){
  temp <- dat[dat$group==groups[i],]
  if (nrow(setA) > nrow(setB)){
    tempA <- temp[1:floor(nrow(temp)/2),]
    tempB <- temp[(1+floor(nrow(temp)/2)):nrow(temp),]
  } else {
    tempA <- temp[1:ceiling(nrow(temp)/2),]
    tempB <- temp[(1+ceiling(nrow(temp)/2)):nrow(temp),] 
  }
  setA <- rbind(setA, tempA)
  setB <- rbind(setB, tempB)
}
Leo
  • 1,773
  • 12
  • 19
2

You can check out my stratified function, which you should be able to use like this:

set.seed(1) ## just so you can reproduce this

## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)

## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]

summary(sample1)
#  Gender  Region  Age          X1          
#  F:235   1:112   1:84   Min.   :-2.82847  
#  M:259   2: 90   2:78   1st Qu.:-0.69711  
#          3: 94   3:82   Median :-0.03200  
#          4: 97   4:80   Mean   :-0.01401  
#          5:101   5:90   3rd Qu.: 0.63844  
#                  6:80   Max.   : 2.90422
summary(sample2)
#  Gender  Region  Age          X1          
#  F:238   1:114   1:85   Min.   :-2.76808  
#  M:268   2: 92   2:81   1st Qu.:-0.55173  
#          3: 97   3:83   Median : 0.02559  
#          4: 99   4:83   Mean   : 0.05789  
#          5:104   5:91   3rd Qu.: 0.74102  
#                  6:83   Max.   : 3.58466 

Compare the following and see if they are within your expectations.

x1 <- round(prop.table(
  xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
  xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
  xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)

It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.


Update:

stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.

set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
#  Gender  Region  Age          X1          
#  F:235   1:112   1:84   Min.   :-2.82847  
#  M:259   2: 90   2:78   1st Qu.:-0.69711  
#          3: 94   3:82   Median :-0.03200  
#          4: 97   4:80   Mean   :-0.01401  
#          5:101   5:90   3rd Qu.: 0.63844  
#                  6:80   Max.   : 2.90422  
#
# $SET2
#  Gender  Region  Age          X1          
#  F:238   1:114   1:85   Min.   :-2.76808  
#  M:268   2: 92   2:81   1st Qu.:-0.55173  
#          3: 97   3:83   Median : 0.02559  
#          4: 99   4:83   Mean   : 0.05789  
#          5:104   5:91   3rd Qu.: 0.74102  
#                  6:83   Max.   : 3.58466
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485