data.table: aggregate, join, and assign by reference

Question

let's call dta the table I want to assign to, and dts the source of the data I want to join and aggregate to dta.

dta = data.table(i=1:4, x=rnorm(4))
dts = data.table(i=rep(1:3, each=3), z=runif(9))

I would think I should be able to join on 'i' and aggregate on it in one statement:

dta[dts, z_sum := sum(i.z), by=i, on='i']

but alas, that does not work

Error in `[.data.table`(dta, dts, `:=`(z_sum, sum(i.z)), by = i, on = "i") : 
  object 'i.z' not found

Enter a frame number, or 0 to exit   

1: dta[dts, `:=`(z_sum, sum(i.z)), by = i, on = "i"]
2: `[.data.table`(dta, dts, `:=`(z_sum, sum(i.z)), by = i, on = "i")

replacing by=i with by=.EACHI gives the wrong result (the last value of z in dts for each value of the i variable)

And leaving out by by entirely just sums all the values dts$z and assigns that same value to all the rows in dta.

Now, I CAN do:

dta[dts[, .(z=sum(z)), keyby=i], z := i.z, on='i']

but it seems like there should be some way to interact with the data.table internals to get this right in one efficient statement rather than the separate aggregation and then joining being done here.

Am I missing something, or is that the best way to do things? I'm doing very large joins (merging tables with hundreds of millions of rows thousands of times), so being able to leverage all the efficiency I can is important.

yea, that works, but is that likely to be more efficient than the two step process I outlined in my last code snippet? — James, Nov 07 '17 at 06:06
I guess both of them are more or less the same except that you are modifying the original object — akrun, Nov 07 '17 at 06:07
I guess when dta is subsetting dts more substantially, a number of computations can be avoided by first subsetting dts with dta, aggregating, then joining back, as you did. One instance where it seems to make sense. Seems that there is an assumption that the i data.table is smaller than the x, which is surprising to me. Thanks for the quick responses! — James, Nov 07 '17 at 06:11

akrun · Accepted Answer · 2017-11-07T06:25:29.570

4

We can use .EACHI

dta[, z_sum := dts[.SD, on = .(i), sum(z), by = .EACHI]$V1]

data

set.seed(24)
dta = data.table(i=1:4, x=rnorm(4))
dts = data.table(i=rep(1:3, each=3), z=runif(9))

edited Nov 07 '17 at 06:25

answered Nov 07 '17 at 05:45

akrun

874,273
37
540
662

nope, that does not produce the right result for me! As mentioned in my post, it give me the last/3rd value of z in dts for each i, not the sum of the zs – James Nov 07 '17 at 05:54
@James, I changed the code slightly. Please check if that works – akrun Nov 07 '17 at 06:25
thanks for keeping at this! i usually try to avoid .SD when performance is important. Is that not going to be an issue here? Also, i'm confused why .EACHI works there. I would have used .(i) (which also produces the correct result!) – James Nov 07 '17 at 06:33
@James Yes, that also produce the same result. I tested only with `.EACHI`. Here, the `.SD` is referening the 'dta'. – akrun Nov 07 '17 at 06:36
turns out there's an issue on this: https://github.com/Rdatatable/data.table/issues/733 – James Nov 07 '17 at 07:35
2

@James I think `dta[, sum_z3 := dts[dta, .(z=sum(z)), by=i, on=.(i)]$z]` works, but again it is better to have a new feature – akrun Nov 07 '17 at 07:43
1

I consider myself to be pretty advanced in data.table, but this looks like magic to me. Can you explain the steps? – Ryan Ward Valverde Sep 16 '21 at 18:50
@RyanWard thank you for the comments – akrun Sep 16 '21 at 18:58
1

@RyanWard The `.SD` within the `dta` is Subset of Data.table from `dta` which we are using in `i` of 'dts' for joining `on` by 'i1'. As we need the `sum` of 'z' for each 'i' in the join, we use the `by` – akrun Sep 16 '21 at 19:09
1

Thank you. I guess you're right, the `.SD` here is really throwing me off. – Ryan Ward Valverde Sep 16 '21 at 19:15

data.table: aggregate, join, and assign by reference

1 Answers1

data