14

I need to summarize a data frame by some variables, ignoring the others. This is sometimes referred to as collapsing. E.g. if I have a dataframe like this:

Widget Type Energy  
egg 1 20  
egg 2 30  
jap 3 50  
jap 1 60

Then collapsing by Widget, with Energy the dependent variable, Energy~Widget, would yield

Widget Energy  
egg  25  
jap  55  

In Excel the closest functionality might be "Pivot tables" and I've worked out how to do it in python ( http://alexholcombe.wordpress.com/2009/01/26/summarizing-data-by-combinations-of-variables-with-python/), and here's an example with R using doBy library to do something very related ( http://www.mail-archive.com/r-help@r-project.org/msg02643.html), but is there an easy way to do the above? And even better is there anything built into the ggplot2 library to create plots that collapse across some variables?

landroni
  • 2,902
  • 1
  • 32
  • 39
Alex Holcombe
  • 2,453
  • 4
  • 24
  • 34

3 Answers3

15

Use aggregate to summarize across a factor:

> df<-read.table(textConnection('
+ egg 1 20
+ egg 2 30
+ jap 3 50
+ jap 1 60'))
> aggregate(df$V3,list(df$V1),mean)
  Group.1  x
1     egg 25
2     jap 55

For more flexibility look at the tapply function and the plyr package.

In ggplot2 use stat_summary to summarize

qplot(V1,V3,data=df,stat="summary",fun.y=mean,geom='bar',width=0.4)
Jyotirmoy Bhattacharya
  • 9,317
  • 3
  • 29
  • 38
  • 2
    `by` also comes in handy from time to time. – Jonathan Chang Apr 01 '10 at 05:42
  • 3
    for an example using `ddply` from the `plyr` package take a look at this related question: http://stackoverflow.com/questions/2473659/r-what-are-the-best-functions-to-deal-with-concatenating-and-averaging-values-in – mropa Apr 01 '10 at 06:08
  • i would check out plyr for a general purpose SAC combine framework (what pivot tables are), it's an excellent resource – Dan Apr 01 '10 at 06:35
  • Thanks! it's taken me awhile to get my head around ggplot2 stat. Here's what I wanted exactly, I think: ggplot(data=df,aes(x=Widget,y=Energy))+ stat_summary(fun.y=mean,ymin=10,ymax=60) – Alex Holcombe Apr 01 '10 at 07:30
  • 1
    Do you really want the long line? Otherwise `stat_summary(fun.y=mean,geom='point')` produces just the points. – Jyotirmoy Bhattacharya Apr 01 '10 at 08:10
5

For those familiar with SQL, another way to manipulate dataframes can be the sqldf command in the sqldf package.

library(sqldf)
sqldf("SELECT Widget, avg(Energy) FROM yourDataFrame GROUP BY Widget")
BlueCoder
  • 293
  • 5
  • 13
1

@Jyotirmoy mentioned that this can be done with the plyr library. Here is what that would look like:

DF <- read.table(text=
"Widget Type Energy  
egg 1 20  
egg 2 30  
jap 3 50  
jap 1 60", header=TRUE)

library("plyr")
ddply(DF, .(Widget), summarise, Energy=mean(Energy))

which gives

> ddply(DF, .(Widget), summarise, Energy=mean(Energy))
  Widget Energy
1    egg     25
2    jap     55
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188