I start off with a data.table with two columns, and want to consecutively apply criteria to filter out observations. Some criteria are computationally burdensome so I want to apply these last (when the dataset is already reduced).
I think this is something data.table should handle much better than data-frame. But my solution with data.table is not more efficient than with data.frame. Is there any way to make this faster?
Example:
library(data.table)
#fake data
df<-data.frame(id=1:1000000,content=round(runif(1000000)*3))
dt<-as.data.table(df,key=id)
for (i in 1:1000000) {
#datatable version (memory intensive and slow)
preselection_dt = dt[id>i]
some1<-preselection_dt[3==content , which=TRUE]
#dataframe version, saves only indices and is fast
preselection_df = df$id>i
some2<-which(3==df$content[preselection_df])
}
(assume the 3==content
is something computationally very intensive, so I would gain a lot by computing this only for those observations where the first condition is true anyways)
I am unable to find a solution that uses data.table-keys efficiently and produces the right results.