11

I have a data frame where each line represents an individual. That data frame has two variables: age and year. I want to make a table of average ages per year. How can I do it?

The best I could come up with was xtabs(age ~ year, dataframe), but this gives me the sum of ages per year.

Waldir Leoncio
  • 10,853
  • 19
  • 77
  • 107

4 Answers4

20

Use aggregate:

xtabs(hp~cyl+gear,aggregate(hp~cyl+gear,mtcars,mean))
   gear
cyl        3        4        5
  4  97.0000  76.0000 102.0000
  6 107.5000 116.5000 175.0000
  8 194.1667   0.0000 299.5000
James
  • 65,548
  • 14
  • 155
  • 193
6

Have a look at the plyr package, specifically ddply

ddply(dataframe, .(year), summarise, mean(age))

If you have to use xtabs... see the above answer.

Justin
  • 42,475
  • 9
  • 93
  • 111
  • Thank you for showing me another very elegant solution, but since my actual problem involves three variables, I'm still going with stabs. :) – Waldir Leoncio Jan 25 '12 at 18:18
  • 2
    you can continue to list variables in ddply as well `ddply(dataframe, .(year), summarise, mean(age), max(height), sd(weight), etc...)` – Justin Jan 25 '12 at 18:25
5

I'm glad that xtabs solution works for you. I must have a package loaded that interferes with it for me (and gives an error). Another solution would be:

tapply(dfrm$age, dfrm$year, FUN=mean)

To get additional dimensions to the table (array) just keep adding additional factors to the second INDEX argument in the form list(fac1, fac2, fac3).

Applied to the example using mtcars:

 tapply(mtcars$hp, list(mtcars$cyl,mtcars$gear), mean)
         3     4     5
4  97.0000  76.0 102.0
6 107.5000 116.5 175.0
8 194.1667    NA 299.5

Or even more compactly:

with(mtcars, tapply(hp, list(cyl, gear), mean))
IRTFM
  • 258,963
  • 21
  • 364
  • 487
1

Another solution is to use 2 xtabs functions, first one to calculate the sum of the ages, 2nd one to get the headcount for each group, then get the mean by sum/headcount.

E.g., xtabs(age ~ year, dataframe)/xtabs( ~ year, dataframe)

However, the aggregate approach can be used to calculate median or other functions applied to the dataframe. Definitely more flexible.

Bach
  • 45
  • 1
  • 6