-1

I have the following exemplary data frame:

df <- as.data.frame(cbind(Month <- c("Jun", "Jul", "Aug", "Oct", "Jun", "Aug", "Feb"),ID <- c(1,1,1,2,2,2,2), No_ind <- c(8,5,2,15,10,9,2)))
names(df) <- c('month', 'ID', 'No_ind')

I have already sorted the No_ind column in an descending order. What I now want to achieve is the following. For every group (ID), I want to get all rows within a group until the cumulative sum of the number of individuals (No_ind) has reached 15 individuals (including the row where we hit 15 ind). I don't want the sum to go beyond group borders, but start anew for each group.

I then want to be able to extract the rows (month * plot combinations) that are needed for getting to 15 individuals. If the algorithm is correct, it chooses all rows (month * plot) combinations of group 1, only the first month*plot combination of group 2.

Many thanks in advance. I really appreciate any help!

ABiologist
  • 43
  • 6
  • 1
    Relevant? https://stackoverflow.com/questions/45549992/group-vector-on-conditional-sum – Sotos Jul 23 '20 at 07:51

1 Answers1

1

You can take the cumulative sum for each ID and select rows until the value becomes 15.

This can be done in base R :

subset(df, as.logical(ave(No_ind, ID, FUN = function(x) 
           seq_along(x) <= which.max(cumsum(x) >= 15))))

#  month ID No_ind
#1   Jun  1      8
#2   Jul  1      5
#3   Aug  1      2
#4   Oct  2     15

Or in dplyr

library(dplyr)
df %>% group_by(ID) %>% slice(seq_len(which.max(cumsum(No_ind) >= 15)))

and data.table :

library(data.table)
setDT(df)[, .SD[seq_len(which.max(cumsum(No_ind) >= 15))], ID]

data

Make sure numbers are treated as numbers and not as strings.

df <- type.convert(df, as.is = TRUE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Is it really that simple? Maybe I misinterpreted. I think this line is confusing *I don't want the sum to go beyond group borders, but start anew for each group.* – Sotos Jul 23 '20 at 07:50
  • Based on the expected output which OP describes I thought this is what they want. If not, they need to give a better example I guess. – Ronak Shah Jul 23 '20 at 07:54
  • Thank you for your help! Ronak is right here. Sorry for being ambiguous. We are pretty close to the optimal solution. However, the last plot*month combination per group that should be included is supposed to be the plot where we cross the threshold of 15 ind. for the first time. Thank you for any further help! – ABiologist Jul 23 '20 at 08:36
  • @ABiologist Do you want to select rows till 15 is crossed or till 15? For 2nd `ID` you select only one row which was 15. Why is that so? – Ronak Shah Jul 23 '20 at 08:40
  • @RonakShah Until we get to 15 or beyond (as it is quite unlikely that we score a clear hit). Example: The increments are: 3 5 10 8. We reach 15 ind summing across the first 3 numbers. As the increments do not allow getting exactly 15 ind, we will take the first 3 (including the row were we hit the 15 ind). However, has the threshold already been reached with a previous combination, I don't want further plots to be selected. – ABiologist Jul 23 '20 at 08:49
  • @RonakShah So if we can manage to get exactly 15 ind, we will take the plots that are needed to get these individuals. If the increments do not allow that, we will take all plots that are needed to cross this threshold for the first time and no further plots that might add to the number of individuals. – ABiologist Jul 23 '20 at 08:54
  • Would there be cases when a particular `ID` do not sum till 15, let's say it would have only 10 individuals? What would be the output in that case? – Ronak Shah Jul 23 '20 at 08:57
  • @RonakShah: The true data has already been filtered by me to exclude IDs that have fewer than 15 individuals. – ABiologist Jul 23 '20 at 09:01
  • @RonakShah Stupendous job! Thank you very much for your help. It works just perfect! :) – ABiologist Jul 23 '20 at 09:13