1

I'm trying to aggregate data by two categories.

data

year    person  expense sex money
2011    kevin   truck   M   1
2011    mike    truck   M   62
2011    sally   truck   F   60
2012    kevin   truck   M   37
2012    mike    truck   M   53
2012    sally   truck   F   95
2013    kevin   truck   M   21
2013    mike    truck   M   13
2013    sally   truck   F   38
2014    kevin   truck   M   48
2014    mike    truck   M   4
2014    sally   truck   F   77
2011    kevin   house   M   7
2011    mike    house   M   94
2011    sally   house   F   79
2012    kevin   house   M   86
2012    mike    house   M   42
2012    sally   house   F   46
2013    kevin   house   M   90
2013    mike    house   M   76
2013    sally   house   F   75
2014    kevin   house   M   70
2014    mike    house   M   91
2014    sally   house   F   62

I'd like to sum the money column if year and person columns match.

desired output

year    person  sex money
2011    kevin   M   8
2011    mike    M   156
2011    sally   F   139
2012    kevin   M   123
2012    mike    M   95
2012    sally   F   141
2013    kevin   M   111
2013    mike    M   89
2013    sally   F   113
2014    kevin   M   118
2014    mike    M   95
2014    sally   F   139

How do I do this?

I tried data_aggregated = aggregate(data$money, by = list(name = data$name), FUN = sum), but that only aggregates my data by name. I don't know how to aggregate it by both name and year. It also creates a dataframe with only two variable columns: name and x.

I also tried concatenating the name+year into an ID variable, but that seems too tedious.

Veerendra Gadekar
  • 4,452
  • 19
  • 24
Jim
  • 715
  • 2
  • 13
  • 26

2 Answers2

4

You can choose from the formula route or list method below. If you would like the gender column also, it can be added just as the others:

aggregate(money ~ person + year + sex, data, sum)

Or with your approach:

aggregate(data$money, by = list(person=data$person, year=data$year, sex=data$sex), FUN=sum)

The package approaches are:

library(dplyr)
data %>% group_by(person, year, sex) %>% summarise(mon_sum=sum(money))
# Source: local data table [12 x 4]
# Groups: person, year
# 
# person  year    sex mon_sum
# (fctr) (int) (fctr)   (int)
# 1   kevin  2011      M       8
# 2    mike  2011      M     156
# 3   sally  2011      F     139
# 4   kevin  2012      M     123
# 5    mike  2012      M      95
# 6   sally  2012      F     141
# 7   kevin  2013      M     111
# 8    mike  2013      M      89
# 9   sally  2013      F     113
# 10  kevin  2014      M     118
# 11   mike  2014      M      95
# 12  sally  2014      F     139

With data.table. This method will prove most efficient in terms of performance and programming time, well worth learning:

library(data.table)
setDT(data)[,sum(money), by=.(person,year, sex)]
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • This is great, thanks! I prefer the dplyr method, because it retains all other ID variables. – Jim Feb 17 '16 at 15:11
  • One question: When I try this with a different data set, I get the following error: "Error: length(rows) == 1 is not TRUE" – Jim Feb 17 '16 at 15:50
  • It's hard to tell without seeing parts of the data. I would double check that it is a `data.frame`. Check with `str(df)` to be sure. Do other troubleshooting techniques to see the difference between that dataset and your originals. – Pierre L Feb 17 '16 at 15:52
2

This is super easy with dplyr:

library(dplyr)
df %>% group_by(year, person, sex) %>% summarise(money = sum(money))

returns

Source: local data frame [12 x 4]
Groups: year [?]

    year person    sex money
   (int) (fctr) (fctr) (int)
1   2011  kevin      M     8
2   2011   mike      M   156
3   2011  sally      F   139
4   2012  kevin      M   123
5   2012   mike      M    95
6   2012  sally      F   141
7   2013  kevin      M   111
8   2013   mike      M    89
9   2013  sally      F   113
10  2014  kevin      M   118
11  2014   mike      M    95
12  2014  sally      F   139
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • One question: When I try this with a different data set, I get the following error: "Error: length(rows) == 1 is not TRUE" – Jim Feb 17 '16 at 15:49
  • Without seeing the structure of the data, it's hard to say. A guess: if you're trying to replace `sum` with `length`, using `dplyr::n` instead avoids some issues. – alistaire Feb 17 '16 at 15:58