3

I start off with a data.table with two columns, and want to consecutively apply criteria to filter out observations. Some criteria are computationally burdensome so I want to apply these last (when the dataset is already reduced).

I think this is something data.table should handle much better than data-frame. But my solution with data.table is not more efficient than with data.frame. Is there any way to make this faster?

Example:

library(data.table)
#fake data
df<-data.frame(id=1:1000000,content=round(runif(1000000)*3))
dt<-as.data.table(df,key=id)

for (i in 1:1000000) {
  #datatable version (memory intensive and slow)  
  preselection_dt = dt[id>i]
  some1<-preselection_dt[3==content  , which=TRUE]

  #dataframe version, saves only indices and is fast
  preselection_df = df$id>i
  some2<-which(3==df$content[preselection_df])
}

(assume the 3==content is something computationally very intensive, so I would gain a lot by computing this only for those observations where the first condition is true anyways)

I am unable to find a solution that uses data.table-keys efficiently and produces the right results.

sheß
  • 484
  • 4
  • 20
  • 1
    maybe you could give more details about what are you doing? Because, in my experience, sometimes it is better to use simply vectors. Your first data.table operation is slow, because you are copying the table to new variable. – minem Jan 05 '18 at 07:58
  • @minem, well basically I am doing something where I filter a dataset with a lot of string/text variables this broader question is this one https://stackoverflow.com/questions/48058104/efficient-string-similarity-grouping . Just because in that context I started using data.tables I had this question an wondered how to approach this efficiently. I understand that in some contexts this might not be ideal, but I guess a canonical answer to the question as is might be helpful for others too – sheß Jan 05 '18 at 11:00

0 Answers0