14

I am working with a huge data table in R containing monthly measurements of temperature for multiple locations, taken by different sources.

The dataset looks like this:

library(data.table)

# Generate random data:
loc <- 1:10
dates <- seq(as.Date("2000-01-01"), as.Date("2004-12-31"), by="month")
mods <- c("A","B", "C", "D", "E")
temp <- runif(length(loc)*length(dates)*length(mods), min=0, max=30)
df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp)

So basically, for location 1, I have measurements from january 2000 to december 2004 taken by model A. Then, I have measurements made by model B. And so on for models C, D and E. And then, so on for location 2 to location 10.

What I need to do is, instead of having five different temperature measurements (from the models), to take the mean temperature for all the models.

As a result, I would have, for each location and each date, not five but ONLY ONE temperature measurement (that would be a multi-model mean).

I tried this:

df2 <- df[, Mean:=mean(Temperature), by=list(Model, Location, Date)]

which didn't work as I expected. I would at least expect the resulting data table to be 1/5th the number of rows of the original table, since I am summarizing five measurements into a single one.

What am I doing wrong?

ekad
  • 14,436
  • 26
  • 44
  • 46
thiagoveloso
  • 2,537
  • 3
  • 28
  • 57
  • 2
    Try using `df2 <- df[, .(mean = mean(Temperature)), by=list(Model, Location, Date)]` – Kunal Puri Apr 10 '16 at 05:34
  • 6
    You should read the package intro. Section #2 is on aggregation and illustrates what Kunal suggested: https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html – Frank Apr 10 '16 at 05:48

2 Answers2

19

I don't think you generated your test data correctly. The function expand.grid() takes a cartesian product of all arguments. I'm not sure why you included the Temperature=temp argument in the expand.grid() call; that duplicates each temperature value for every single key combination, resulting in a data.table with 9 million rows (this is (10*60*5)^2). I think you intended one temperature value per key, which should result in 10*60*5 rows:

df <- data.table(expand.grid(Location=loc,Date=dates,Model=mods),Temperature=temp);
df;
##       Location       Date Model Temperature
##    1:        1 2000-01-01     A    2.469751
##    2:        2 2000-01-01     A   16.103135
##    3:        3 2000-01-01     A    7.147051
##    4:        4 2000-01-01     A   10.301937
##    5:        5 2000-01-01     A   16.760238
##   ---
## 2996:        6 2004-12-01     E   26.293968
## 2997:        7 2004-12-01     E    8.446528
## 2998:        8 2004-12-01     E   29.003001
## 2999:        9 2004-12-01     E   12.076765
## 3000:       10 2004-12-01     E   28.410980

If this is correct, you can generate the means across models with this:

df[,.(Mean=mean(Temperature)),.(Location,Date)];
##      Location       Date      Mean
##   1:        1 2000-01-01  9.498497
##   2:        2 2000-01-01 11.744622
##   3:        3 2000-01-01 15.691228
##   4:        4 2000-01-01 11.457154
##   5:        5 2000-01-01  8.897931
##  ---
## 596:        6 2004-12-01 17.587000
## 597:        7 2004-12-01 19.555963
## 598:        8 2004-12-01 15.710465
## 599:        9 2004-12-01 15.322790
## 600:       10 2004-12-01 20.240392

Note that the := operator does not actually aggregate. It only adds, modifies, or deletes columns in the original data.table. It is possible to add a new column (or overwrite an old column) with duplications of an aggregated calculation (e.g. see http://www.r-bloggers.com/two-of-my-favorite-data-table-features/), but that's not what you want.

In general, when you aggregate a table of data, you are necessarily producing a new table that is reduced to one row per aggregation key. The := operator does not do this.

Instead, we need to run a normal index operation on the data.table, grouping by the required aggregation key (which will automatically be included in the output data.table), and add to that the j argument which will be evaluated once for each group. The result will be a reduced version of the original table, with the results of all j argument evaluations merged with their respective aggregation keys. Since our j argument results in a scalar value for each group, our result will be one row per Location/Date aggregation key.

bgoldst
  • 34,190
  • 6
  • 38
  • 64
  • 1
    Why does your `by=` not contain "Model"? – Frank Apr 10 '16 at 05:50
  • 3
    The OP indicated he wants one temperature measurement for each location and date, representing the average across all models for that location/date combination. We should not be grouping by model. – bgoldst Apr 10 '16 at 05:50
  • 4
    @bgoldst thank you so much for your answer. In fact I committed a mistake when simulating my data. Your suggested command produced the ouput I expected and your explanation helped me to understand a lot better how data tables work. I very much appreciated it. – thiagoveloso Apr 10 '16 at 06:08
  • 2
    Your note about the `:=` operator was particularly helpful. – RDavey Sep 27 '19 at 10:21
  • can we aggregate the table in place? Or does it have to create a new table? Doesn't it mean that it will be slow because data.table's speed comes from its inplace computations? – TYL Jun 10 '21 at 11:07
4

If we are using data.table, the CJ can be used

 CJ(Location=loc, date= dates,Model= mods)[, 
         Temperature:= temp][, .(Mean = mean(Temperature)), by = .(Location, date)]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I was wondering... How to apply your code to the data table `df` from the original question? – thiagoveloso Apr 13 '16 at 08:56
  • @thiagoveloso The `CJ` part gives a data.table output. i.e. `CJ(Location=loc, date= dates,Model= mods)` – akrun Apr 13 '16 at 08:57
  • What about the part `[, Temperature:= temp]`? Sorry for asking, but I just want to get it right... – thiagoveloso Apr 13 '16 at 09:03
  • 1
    @thiagoveloso That is creating the 'Temperature' column in the `data.table`. If you look at the other post, `expand.grid` also leaves the 'Temperature' column outside. The `CJ` part is doing similar to `expand.grid` – akrun Apr 13 '16 at 09:06
  • 1
    Ok, now I got it. Thanks a lot. – thiagoveloso Apr 13 '16 at 09:08
  • Just one last question: apparently, this function removes locations with NA in the temperature field. Is it possible to keep them? – thiagoveloso Apr 13 '16 at 09:19
  • @thiagoveloso If you look at the usage `CJ(..., sorted = TRUE, unique = FALSE)`, the default option `sorts`, which may remove the NA value. Perhaps using `sorted=FALSE` would be the option (not tested) – akrun Apr 13 '16 at 09:21
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109019/discussion-between-thiagoveloso-and-akrun). – thiagoveloso Apr 13 '16 at 09:28