-1

i have a data frame like this

set.seed(500)
df=data.frame(group=c(rep("A",20),rep("B",20),rep("C",20),rep("D",20)),value=round(runif(80,min=1,max=100)))

for each group i want to take the top value rows until their sum exceed/meet the target value

target=data.frame(group=c("A","B","C","D"),value=c(1000,400,500,300))

and output the new groups as 4 data frames.

I sorted them from biggest to smallest

df=df[with(df, order(group,-value)), ]

the desired output is

group value
a    98
a    93
...
a  (sum from 98 to here, the group a subtotal should exceed 1000)
b  93
...
c   99

What's the best way of doing this?

Thanks.

santoku
  • 3,297
  • 7
  • 48
  • 76
  • So can you give the desired output for your sample data? I'm really confused on what the result should be. – MrFlick Sep 14 '14 at 03:56
  • Suppose if the group `a` subtotal for a row say 11 becomes 1000. and the next row is 0, So, again 1000, for the next row 15. So, are you saying that you want rows up to 13? – akrun Sep 15 '14 at 15:25
  • @santoku I updated the code. I hope this works for you – akrun Sep 15 '14 at 15:47

3 Answers3

2

You could also do: (Using the ordered df)

 indx <- rep(target$value, table(df$group))
 val1 <- with(df, ave(value, group, FUN=cumsum))
 df[val1 <=indx,]
 #       group value
 #3      A    98
 #8      A    93
 #12     A    89
 #1      A    84
 #9      A    83
 #5      A    81
 #13     A    77
 #2      A    73
 #15     A    73
 #10     A    71
 #18     A    62
 #19     A    61
 #7      A    52
 #39     B    93
 #28     B    90
 #36     B    84
 #37     B    83
 #52     C    99
 #59     C    96
 #45     C    86
 #43     C    84
 #58     C    81
 #65     D    93
 #75     D    87
 #63     D    85

Or using data.table on the ordered df

 library(data.table)
 setkey(setDT(df), group)
 setkey(setDT(target), group)
 DT1 <- df[df[target, value1:= i.value][,
           cumsum(value) <value1, by=group]$V1, 1:2, with=FALSE]

Update

I guess you wanted something like this:

  indx2 <- which(val1 <=indx)
  indx3 <- unname(tapply(indx2,cumsum(c(TRUE,diff(indx2)!=1)), tail,1)+1)
  df1 <- df[sort(c(indx2,indx3)),]
   tapply(df1$value, df1$group, FUN=sum)
   # A    B    C    D 
  #1048  432  518  342 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • thanks. i wanted to use the first method, however, it doesn't return the rows of which the sum exceeds the value, but rather, below it. – santoku Sep 15 '14 at 15:12
1

This splits and limits the items in the dataframe. The next one-liner will pick the last row:

> lapply( split(df, df[[1]] ) , function(d) d[ cumsum( d[[2]]) < 200 , ] )
$A
  group value
1     A    84
2     A    73

$B
   group value
21     B     9
22     B    81
23     B     5
24     B    54
25     B    28

$C
   group value
41     C    20
42     C    43
43     C    84
44     C    49

$D
   group value
61     D     4
62     D    77
63     D    85

Then use tail

> lapply( split(df, df[[1]] ) , function(d) tail( d[ cumsum( d[[2]]) < 200 , ] ,1))
$A
  group value
2     A    73

$B
   group value
25     B    28

$C
   group value
44     C    49

$D
   group value
63     D    85

And if you want to pick the "largest values" then order the dataframe before doing the summation:

> lapply( split(df[order(df[[2]], decreasing=TRUE), ] , df[[1]] ) , function(d) tail( d[ cumsum( d[[2]]) < 200 , ] ,1))
$A
  group value
3     A    98

$B
   group value
62     D    77

$C
   group value
71     D    34

$D
   group value
74     D     2
IRTFM
  • 258,963
  • 21
  • 364
  • 487
-1

If i understand correctly, you want to the the largest values from each group until the sum of all thsoe values exceeds a certain threshold. If so, I think this code will do that

newdfs<-Map(function(d, m) { 
    d <-d[order(-d$value), ]
    d[cumsum(d$value) < m, ]
}, split(df, df$group), target$value[match(levels(df$group), target$group)])
newdfs

This reurns the data.frames in a list which is almost certainyl better than creating a bunch of new data.frames. If you wanted to merge the results into a single data.frame, you could do

do.call(rbind, newdfs)

to get

     group value
A.3      A    98
A.8      A    93
A.12     A    89
A.1      A    84
A.9      A    83
A.5      A    81
A.13     A    77
A.2      A    73
A.15     A    73
A.10     A    71
A.18     A    62
A.19     A    61
A.7      A    52
B.39     B    93
B.28     B    90
B.36     B    84
B.37     B    83
C.52     C    99
C.59     C    96
C.45     C    86
C.43     C    84
C.58     C    81
D.65     D    93
D.75     D    87
D.63     D    85
MrFlick
  • 195,160
  • 17
  • 277
  • 295