0

I would like to summarise this data set by grouping 1st by period, and 2nd by Payer id so that results are shown as subtotal for any given user by month as follows:

data.frame:

Payer   Period
1   10  1-1015
2   15  2-1015
3   14  3-1015
1   1   1-1015
3   5   1-1015
1   7   4-1015
3   8   4-1015
1   4   5-1015

And results should look like this:

Payer   Period
1   11  1-1015
3   5   1-1015
2   15  2-1015
3   14  3-1015
1   7   4-1015
3   8   4-1015
1   4   5-1015

Which is the best way to do this? Thanks!

smci
  • 32,567
  • 20
  • 113
  • 146
Chelo F
  • 43
  • 5
  • You mean you want to order first by period and then by payer? – theamateurdataanalyst Jul 12 '15 at 19:02
  • Was the `$` the name of a column? (I removed it in my edit because I thought it was there by mistake) – talat Jul 12 '15 at 19:02
  • You have one column name missing. If there is an ID column, `aggregate(Payer~ID+Period, df1, FUN=sum)` – akrun Jul 12 '15 at 19:02
  • Let's call the missing middle column name `Amount`... – smci Jul 12 '15 at 19:06
  • @theamateurdataanalyst thanks, I meant summing the totals by payer for every month – Chelo F Jul 12 '15 at 19:06
  • @akrun exacty. the $ column is Amount – Chelo F Jul 12 '15 at 19:08
  • 1
    depends on what you mean by *best*, really. If you want a fast way to do this with a flexible syntax, check out the [data.table package tutorials](https://github.com/Rdatatable/data.table/wiki/Getting-started) (@akrun's answer). See also [benchmarks on grouping](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). – Arun Jul 12 '15 at 19:14

2 Answers2

4

You can just do aggregate assuming that there are three columns.

 aggregate(Amount~., df1, FUN=sum)
 #    Payer Period Amount
 #1     1 1-1015     11
 #2     3 1-1015      5
 #3     2 2-1015     15
 #4     3 3-1015     14
 #5     1 4-1015      7
 #6     3 4-1015      8
 #7     1 5-1015      4

Or

 library(data.table)#v1.9.5+
 setDT(df1)[, list(Amount=sum(Amount)), .(Period, Payer)]
 #    Period Payer Amount
 #1: 1-1015     1     11
 #2: 2-1015     2     15
 #3: 3-1015     3     14
 #4: 1-1015     3      5
 #5: 4-1015     1      7
 #6: 4-1015     3      8
 #7: 5-1015     1      4

Using a different order

 aggregate(Amount~., df2, FUN=sum)
 #  Payer Period Amount
 #1     1 1-1015     11
 #2     3 1-1015      5
 #3     2 2-1015     15
 #4     3 3-1015     14
 #5     1 4-1015      7
 #6     3 4-1015      8
 #7     1 5-1015      4

data

 df1 <- structure(list(Payer = c(1L, 2L, 3L, 1L, 3L, 1L, 3L, 1L), 
 Amount = c(10L, 
 15L, 14L, 1L, 5L, 7L, 8L, 4L), Period = c("1-1015", "2-1015", 
 "3-1015", "1-1015", "1-1015", "4-1015", "4-1015", "5-1015")),
 .Names = c("Payer", 
  "Amount", "Period"), class = "data.frame", row.names = c(NA, -8L))

 set.seed(24)
 df2 <- df1[sample(nrow(df1)),]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This relies on the dataset having all entries in order of increasing Period, Payer. If any of the entries aren't, the result will be out of order. – smci Jul 12 '15 at 19:07
  • 1
    @smci. I do not think that is true in either instance. Neither `aggregate` nor the "by" argument to `data.table.[` require their arguments to be ordered by group. – IRTFM Jul 12 '15 at 19:08
  • I think @smci was thinking of sas :} – rawr Jul 12 '15 at 19:21
  • 1
    God no, I wouldn't touch SAS or SPSS. – smci Jul 12 '15 at 19:23
1
require(dplyr)
df %>% group_by(Period,Payer) %>%
    summarize(Amount = sum(Amount)) %>%
    ungroup() # this should ungroup by the last grouped var, i.e. Payer

# if that doesn't work, then add an explicit %>% arrange(Period, Payer)
smci
  • 32,567
  • 20
  • 113
  • 146
  • Is there a reason you choose to use this over `aggregate`? – theamateurdataanalyst Jul 12 '15 at 19:06
  • @theamateurdataanalyst: I thought the aggregate did not give explicit ordering by Period, Payer (as the OP asked for). But it does. So this is just a dplyr equivalent to the aggregate. dplyr(/data.table) almost surely performs better for high cardinality grouping, i.e. large files. – smci Jul 12 '15 at 19:22