let's call dta the table I want to assign to, and dts the source of the data I want to join and aggregate to dta.
dta = data.table(i=1:4, x=rnorm(4))
dts = data.table(i=rep(1:3, each=3), z=runif(9))
I would think I should be able to join on 'i' and aggregate on it in one statement:
dta[dts, z_sum := sum(i.z), by=i, on='i']
but alas, that does not work
Error in `[.data.table`(dta, dts, `:=`(z_sum, sum(i.z)), by = i, on = "i") :
object 'i.z' not found
Enter a frame number, or 0 to exit
1: dta[dts, `:=`(z_sum, sum(i.z)), by = i, on = "i"]
2: `[.data.table`(dta, dts, `:=`(z_sum, sum(i.z)), by = i, on = "i")
replacing by=i
with by=.EACHI
gives the wrong result (the last value of z
in dts
for each value of the i
variable)
And leaving out by by
entirely just sums all the values dts$z
and assigns that same value to all the rows in dta
.
Now, I CAN do:
dta[dts[, .(z=sum(z)), keyby=i], z := i.z, on='i']
but it seems like there should be some way to interact with the data.table internals to get this right in one efficient statement rather than the separate aggregation and then joining being done here.
Am I missing something, or is that the best way to do things? I'm doing very large joins (merging tables with hundreds of millions of rows thousands of times), so being able to leverage all the efficiency I can is important.