I have a extremely large data.table with 1.6x10^8 rows and I want to perform a row-wise operation between the columns exposure
and exposure.before.index
, as seen in my example below.
I created the column TI (i.e. Treatment Intensification) that is an indicator of whether a not an ID is currently on a drug/drugs, exposure
, that is different from any drugs they were on at each ID's respective first row, exposure.before.index
. You can review my code and observe that the final output is as explained.
library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b","c","c"),
drugA=c(1,1,1,0,0,0,0),
drugB=c(0,1,1,1,0,0,0),
drugC=c(0,0,1,0,1,0,0))
DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste,
c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD,
seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
DT[exposure=="NA",exposure:="NONE"]
DT[,exposure.before.index:=c("drugA","drugA","drugA","drugB","drugB","NONE","NONE")]
DT[,CNT:=1:.N]
DT[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
DT[is.na(TI),TI:=FALSE]
DT
ID drugA drugB drugC exposure exposure.before.index CNT TI
1: a 1 0 0 drugA drugA 1 FALSE
2: a 1 1 0 drugA+drugB drugA 2 TRUE
3: a 1 1 1 drugA+drugB+drugC drugA 3 TRUE
4: b 0 1 0 drugB drugB 4 FALSE
5: b 0 0 1 drugC drugB 5 TRUE
6: c 0 0 0 NONE NONE 6 FALSE
7: c 0 0 0 NONE NONE 7 FALSE
I created CNT in order to apply my function any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]")))
between exposure
and exposure.before.index
. Due to the 1.6x10^8 rows that I have this method is taking quite some time. I've usually use this data.table[...,by="CNT"] technique when I want to apply a certain operation/function row-wise, but I'm finding this not robust for extremely large data.table's. Is there other methods that some of y'all have that are more robust than my method?
I've found other questions similar to my topic but the answers weren't generalized for applying a row-wise operation on a user-defined function in a robust manner.
Any help and/or advice is appreciated.