3

I am trying to do some aggregations in a data.table and I am facing one challenge that cannot find the solution for. The challenge is really simple, I would like to summarise some of the values in the data.table along more than one dimension.

I have no problems getting the following code to work:

Export4R[,sum(units),by=Type]

This gives something along the following lines:

Type    Value
foobar  45
barfoo  25

But now I like to break it a bit down a bit further and hope to get a table like this:

Type    Month    Value
foobar  Mar      12
foobar  Apr      7
....

I tried to do this with a line of code like, but unfortunately this doesn't seem to work:

Export4R[,sum(units),by=Type,Month]

This is most likely a very simple issues, but I have problems finding the answer.

Thanks for the assistance!

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
Jochem
  • 3,295
  • 4
  • 30
  • 55
  • @Matthew Dowle thanks for the edit and you are right. Usually do it but forgot it for some reason. – Jochem Nov 26 '12 at 10:06

1 Answers1

5
Export4R[,sum(units),by="Type,Month"]

or

Export4R[,sum(units),by=list(Type,Month)]



The latter syntax allows expressions of column names and naming; e.g.,

Export4R[,sum(units),by=list(Grp1=substring(Type,1,2), Grp2=Month)]

Btw, you can format long queries over many lines :

Export4R[,list(
    s = sum(units)
    ,m = mean(units)
),by=list(
    Grp1=substring(Type,1,2)
    ,Grp2=Month
)]

The reason for putting the commas at the beginning like that is so that you can easily add and comment out columns, without messing with the closing brackets of the last item; e.g,

Export4R[,list(
    s = sum(units)
    # ,m = mean(units)
),by=list(
    Grp1=substring(Type,1,2)
    # ,Grp2=Month
)]

That idea comes from SQL.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • @MatthewDowle what's performance like when there are `by` expressions that are not covered by an index (key)? I'm thinking of using this to group by list column expressions in a table with 5+M rows and thousands of different group values. – Sim Dec 21 '12 at 01:09
  • @Sim Best just to try it and time for your dataset. Try and keep to integer types where possible. There is a benchmark in the tag somewhere timing unkeyed vs keyed by and the point it makes a difference. – Matt Dowle Dec 21 '12 at 01:43