How to summarise a table by 2 columns in R

Question

I would like to summarise this data set by grouping 1st by period, and 2nd by Payer id so that results are shown as subtotal for any given user by month as follows:

data.frame:

Payer   Period
1   10  1-1015
2   15  2-1015
3   14  3-1015
1   1   1-1015
3   5   1-1015
1   7   4-1015
3   8   4-1015
1   4   5-1015

And results should look like this:

Payer   Period
1   11  1-1015
3   5   1-1015
2   15  2-1015
3   14  3-1015
1   7   4-1015
3   8   4-1015
1   4   5-1015

Which is the best way to do this? Thanks!

You mean you want to order first by period and then by payer? — theamateurdataanalyst, Jul 12 '15 at 19:02
Was the `$` the name of a column? (I removed it in my edit because I thought it was there by mistake) — talat, Jul 12 '15 at 19:02
You have one column name missing. If there is an ID column, `aggregate(Payer~ID+Period, df1, FUN=sum)` — akrun, Jul 12 '15 at 19:02
@theamateurdataanalyst thanks, I meant summing the totals by payer for every month — Chelo F, Jul 12 '15 at 19:06
depends on what you mean by *best*, really. If you want a fast way to do this with a flexible syntax, check out the [data.table package tutorials](https://github.com/Rdatatable/data.table/wiki/Getting-started) (@akrun's answer). See also [benchmarks on grouping](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping). — Arun, Jul 12 '15 at 19:14

akrun · Answer 1 · 2015-07-12T19:11:51.057

You can just do aggregate assuming that there are three columns.

 aggregate(Amount~., df1, FUN=sum)
 #    Payer Period Amount
 #1     1 1-1015     11
 #2     3 1-1015      5
 #3     2 2-1015     15
 #4     3 3-1015     14
 #5     1 4-1015      7
 #6     3 4-1015      8
 #7     1 5-1015      4

Or

 library(data.table)#v1.9.5+
 setDT(df1)[, list(Amount=sum(Amount)), .(Period, Payer)]
 #    Period Payer Amount
 #1: 1-1015     1     11
 #2: 2-1015     2     15
 #3: 3-1015     3     14
 #4: 1-1015     3      5
 #5: 4-1015     1      7
 #6: 4-1015     3      8
 #7: 5-1015     1      4

Using a different order

 aggregate(Amount~., df2, FUN=sum)
 #  Payer Period Amount
 #1     1 1-1015     11
 #2     3 1-1015      5
 #3     2 2-1015     15
 #4     3 3-1015     14
 #5     1 4-1015      7
 #6     3 4-1015      8
 #7     1 5-1015      4

data

 df1 <- structure(list(Payer = c(1L, 2L, 3L, 1L, 3L, 1L, 3L, 1L), 
 Amount = c(10L, 
 15L, 14L, 1L, 5L, 7L, 8L, 4L), Period = c("1-1015", "2-1015", 
 "3-1015", "1-1015", "1-1015", "4-1015", "4-1015", "5-1015")),
 .Names = c("Payer", 
  "Amount", "Period"), class = "data.frame", row.names = c(NA, -8L))

 set.seed(24)
 df2 <- df1[sample(nrow(df1)),]

This relies on the dataset having all entries in order of increasing Period, Payer. If any of the entries aren't, the result will be out of order. — smci, Jul 12 '15 at 19:07
@smci. I do not think that is true in either instance. Neither `aggregate` nor the "by" argument to `data.table.[` require their arguments to be ordered by group. — IRTFM, Jul 12 '15 at 19:08

score 1 · Answer 2 · answered Jul 12 '15 at 19:05

1

require(dplyr)
df %>% group_by(Period,Payer) %>%
    summarize(Amount = sum(Amount)) %>%
    ungroup() # this should ungroup by the last grouped var, i.e. Payer

# if that doesn't work, then add an explicit %>% arrange(Period, Payer)

answered Jul 12 '15 at 19:05

smci

32,567
20
113
146

Is there a reason you choose to use this over `aggregate`? – theamateurdataanalyst Jul 12 '15 at 19:06
@theamateurdataanalyst: I thought the aggregate did not give explicit ordering by Period, Payer (as the OP asked for). But it does. So this is just a dplyr equivalent to the aggregate. dplyr(/data.table) almost surely performs better for high cardinality grouping, i.e. large files. – smci Jul 12 '15 at 19:22

How to summarise a table by 2 columns in R

2 Answers2

data