0

This post has been editted to more accurately describe the situation. I am utilising a form of jackknife sampling for my work. The jackknifed data will be used for calibration of a model, and the unused data will be used for validation.

Rather than perform the analysis immediately, I want to save the jackknifed samples as dataframes, as well as the data which was removed for each sample...

It's hard to explain, so I will use an example to illustrate:

The aim in the example is to create the datasets 4 times. Each time there should be 2 datasets - 1 of length 9 (the calibration one), and 1 of length 3 (the validation one).

df <-
  data.frame(value1 = 1:(3*4),
          value2 = seq(from = 1000, by = 50, length.out = 3*4),
          tosplit = rep(1:4, each = 3))

df #df represents the dataframe in its entirety

dfs <- split(df, df$tosplit) #df is now split into 4 equal parts of 3

#####

> #Replicate 1
> r1_3parts <- do.call("rbind", dfs[1:3])
> r1_1parts <- do.call("rbind", dfs[4])
> 
> r1_3parts
    value1 value2 tosplit
1.1      1   1000       1
1.2      2   1050       1
1.3      3   1100       1
2.4      4   1150       2
2.5      5   1200       2
2.6      6   1250       2
3.7      7   1300       3
3.8      8   1350       3
3.9      9   1400       3
> r1_1parts
     value1 value2 tosplit
4.10     10   1450       4
4.11     11   1500       4
4.12     12   1550       4
> 
> #Replicate 2
> r2_3parts <- do.call("rbind", dfs[2:4])
> r2_1parts <- do.call("rbind", dfs[1])
> 
> r2_3parts
     value1 value2 tosplit
2.4       4   1150       2
2.5       5   1200       2
2.6       6   1250       2
3.7       7   1300       3
3.8       8   1350       3
3.9       9   1400       3
4.10     10   1450       4
4.11     11   1500       4
4.12     12   1550       4
> r2_1parts
    value1 value2 tosplit
1.1      1   1000       1
1.2      2   1050       1
1.3      3   1100       1
> 
> #Replicate 3
> r3_3parts <- do.call("rbind", dfs[c(3:4, 1)])
> r3_1parts <- do.call("rbind", dfs[2])
> 
> r3_3parts
     value1 value2 tosplit
3.7       7   1300       3
3.8       8   1350       3
3.9       9   1400       3
4.10     10   1450       4
4.11     11   1500       4
4.12     12   1550       4
1.1       1   1000       1
1.2       2   1050       1
1.3       3   1100       1
> r3_1parts
    value1 value2 tosplit
2.4      4   1150       2
2.5      5   1200       2
2.6      6   1250       2
> 
> 
> #Replicate 4
> r4_3parts <- do.call("rbind", dfs[c(4, 1:2)])
> r4_1parts <- do.call("rbind", dfs[3])
> 
> r4_3parts
     value1 value2 tosplit
4.10     10   1450       4
4.11     11   1500       4
4.12     12   1550       4
1.1       1   1000       1
1.2       2   1050       1
1.3       3   1100       1
2.4       4   1150       2
2.5       5   1200       2
2.6       6   1250       2
> r4_1parts
    value1 value2 tosplit
3.7      7   1300       3
3.8      8   1350       3
3.9      9   1400       3
> 

This doesn't appear to be an option in packages that I can find - they default to just creating the statistics for you. What I want is to see the sample datasets, and also specify their relative size. Is this possible in an existing package, or if not, is there a suitable way to determine this in a more automated fashion?

Community
  • 1
  • 1
Quinn
  • 419
  • 1
  • 5
  • 21

1 Answers1

1

Without a random component, this doesn't really strike me as a bootstrap. It seems you are pursuing a variation on permutation.

The data frame can be split with a fairly simple function.

df <-
  data.frame(value1 = 1:(3*4),
             value2 = seq(from = 1000, by = 50, length.out = 3*4),
             tosplit = rep(1:4, each = 3))

split_into_two <- function(data, split_var, split_val){
  split <- data[[split_var]] %in% split_val

  split(data, split)
}

split_into_two(df, "tosplit", 1:3)

To get the four permutations you describe, we can use lapply:

lapply(list(1:3, 2:4, c(4, 1:2), c(3:4, 1)),
       function(x) split_into_two(df, "tosplit", x))

This saves a great deal of copy-paste.

Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • Thanks @Benjamin. I was advised that this was bootstrapping, and was referred to as such in all discussions with my PhD supervisor. But I see now that a more appropriate definition for this is actually jackknife resampling. I'll update the description to fit this. – Quinn Jan 19 '17 at 15:08