0

I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat). How should I control that all members would be picked at least in one run? I need the code in r.

To make the question more clear, lets define the row names as:

rownames(dataset) = A,B,C,D,E,F,G,H,J,I

if I subsample 3 times:

A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J

The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.

Nmgh
  • 113
  • 7
  • The question is not clear, what are *"rows with the size of 800"*? Can you give a small example of input and expected output, for instance, with 10 and 8 instead of 1000 and 800? – Rui Barradas Jan 24 '23 at 08:08
  • Really, coding the approach will be simple once you can define it specifically. One option would be to sample without replacement 800 times, then sample a further 200 times with or without replacement, but this WILL impact the randomness of your sample. – Paul Stafford Allen Jan 24 '23 at 08:08
  • @RuiBarradas I modified the main post. I hope it is clear now. Thanks – Nmgh Jan 24 '23 at 08:27
  • @PaulStaffordAllen Thanks for your comment. I did not get exactly your approach. I have modified the main post to make my question more clear. – Nmgh Jan 24 '23 at 08:28
  • @NickCHK In each run I would do further processing and then combine the results. So, I need to pick lets say 90 percent of data each time and do my computation. – Nmgh Jan 24 '23 at 08:30

1 Answers1

1

One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.

num_rows = 1000
num_subsamples = 1000
subsample_size = 900

full_index = 1:num_rows

dat = data.frame(i = full_index)

# Randomly assign guaranteed subsamples
# Make sure that we don't accidentally assign more than the subsample size
# If we're subsampling 90% of the data, it'll take at most a few tries
biggest_guaranteed_subsample = num_rows
while (biggest_guaranteed_subsample > subsample_size) {
  # Assign the subsample that the row is guaranteed to appear in
  dat$guarantee = sample(1:num_subsamples, replace = TRUE)
  # Find the subsample with the most guaranteed slots taken
  biggest_guaranteed_subsample = max(table(dat$guarantee))
}


# Assign subsamples
for (ss in 1:num_subsamples) {
  # Pick out any rows guaranteed a slot in that subsample
  my_sub = dat[dat$guarantee == ss, 'i']
  # And randomly select the rest
  my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)], 
                            subsample_size - length(my_sub), 
                            replace = FALSE))
  # Do your subsample calculation here
}
NickCHK
  • 1,093
  • 7
  • 17
  • Thanks @NickCHK. I ran your code, and I assume in the first while loop you are trying to make sure that each item would be selected in one of the runs. However, some of the elements do not have any slot in dat$guarantee column, ex, after running the code 1 does not have any slot while 2 has 4 slots. – Nmgh Jan 24 '23 at 08:52
  • In dat, data row i is guaranteed to be selected in the subsample number listed in guarantee. If a value does not appear in the guarantee column, it means that that subsample contains no guaranteed picks, not that that row is never selected. Does that help? – NickCHK Jan 24 '23 at 09:03