2

I have a extremely large data.table with 1.6x10^8 rows and I want to perform a row-wise operation between the columns exposure and exposure.before.index, as seen in my example below.

I created the column TI (i.e. Treatment Intensification) that is an indicator of whether a not an ID is currently on a drug/drugs, exposure, that is different from any drugs they were on at each ID's respective first row, exposure.before.index. You can review my code and observe that the final output is as explained.

library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b","c","c"),
                             drugA=c(1,1,1,0,0,0,0),
                             drugB=c(0,1,1,1,0,0,0),
                             drugC=c(0,0,1,0,1,0,0))
DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste, 
                                                 c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD, 
                                                       seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
DT[exposure=="NA",exposure:="NONE"]
DT[,exposure.before.index:=c("drugA","drugA","drugA","drugB","drugB","NONE","NONE")]
DT[,CNT:=1:.N]
DT[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
DT[is.na(TI),TI:=FALSE]
DT

   ID drugA drugB drugC          exposure exposure.before.index CNT    TI
1:  a     1     0     0             drugA                 drugA   1 FALSE
2:  a     1     1     0       drugA+drugB                 drugA   2  TRUE
3:  a     1     1     1 drugA+drugB+drugC                 drugA   3  TRUE
4:  b     0     1     0             drugB                 drugB   4 FALSE
5:  b     0     0     1             drugC                 drugB   5  TRUE
6:  c     0     0     0              NONE                  NONE   6 FALSE
7:  c     0     0     0              NONE                  NONE   7 FALSE

I created CNT in order to apply my function any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]"))) between exposure and exposure.before.index. Due to the 1.6x10^8 rows that I have this method is taking quite some time. I've usually use this data.table[...,by="CNT"] technique when I want to apply a certain operation/function row-wise, but I'm finding this not robust for extremely large data.table's. Is there other methods that some of y'all have that are more robust than my method?

I've found other questions similar to my topic but the answers weren't generalized for applying a row-wise operation on a user-defined function in a robust manner.

Any help and/or advice is appreciated.

theneil
  • 488
  • 1
  • 4
  • 14
  • 1
    Hi theneil, can you please explain the meaning of the integer columns? Are all the drugs encoded in integer columns? On row 2 `exposure.before.index = "drugA"`, but in your text you state "is different from any drugs they were on before, `exposure.before.index`". Can you clarify? In general, string operations are slow, so if you can simplify your problem into integer or logical comparisons, you'll be a lot better off. – Ian Campbell Apr 06 '20 at 18:49
  • @IanCampbell you caught a mistake a made. So `exposure.before.index` should consist of the drug a patient was on in the first row. I had arbitrarily chose drug types at random while trying to come up with this column and was focusing on the desired output rather than following the logic from my original working data.table. I'm making the edits right now; hope it makes more sense – theneil Apr 06 '20 at 21:39
  • Can you confirm whether the ```drugA```, ```drugB```, and ```drugC``` are actually available in your dataset or if they only exist to create the dataset here? If it is the latter, I recommend removing them to only include what your actual dataset looks like. – Cole Apr 07 '20 at 02:16
  • @Cole my actual dataset has over 15 different drug class by generic name. Including the actual names of the drugs is overkill; I created this example so that I can reproduce it on my actual dataset, regardless of what the drug names are. I just saw that you posted an answer. I will thoroughly go over it later in the day. Thank you!! – theneil Apr 07 '20 at 17:57
  • Sorry, not the naming convention but the fields themselves. If anything i would shorten the names to A, B, C. – Cole Apr 07 '20 at 22:09

1 Answers1

1

This is difficult. strsplit will not be very memory efficient for this 100 million dataset - each row requires two lists to be made from strsplit. My suggestion is to use a function and skip the by = 1:.N step.

exposed = function(before, after) {
  out = vector(length = length(before))
  for (i in seq_along(before)) {
    bef = before[i]
    aft = after[i]
    if (bef == "NONE" || aft == "NONE") 
      out[i] = FALSE
    else
      out[i] = any(!unlist(strsplit(aft, "[+]", fixed = TRUE), use.names = FALSE)%chin%unlist(strsplit(bef, "[+]", fixed = TRUE), use.names = FALSE))
  }
  return(out)
}

DT[, TI3 := exposed(exposure.before.index, exposure)]

> DT[, .(exposure.before.index, exposure, TI, TI3)]
   exposure.before.index          exposure    TI   TI3
1:                 drugA             drugA FALSE FALSE
2:                 drugA       drugA+drugB  TRUE  TRUE
3:                 drugA drugA+drugB+drugC  TRUE  TRUE
4:                 drugB             drugB FALSE FALSE
5:                 drugB             drugC  TRUE  TRUE
6:                  NONE              NONE FALSE FALSE
7:                  NONE              NONE FALSE FALSE

Note there are a few optimizations here:

  1. Using %chin% instead of %in% which is a utility function that is faster on character vectors than %in%
  2. Using strsplit(..., fixed = TRUE) to optimize - this isn't a regular expression we are using. Likely the biggest performance boost.
  3. unlist(..., use.names = FALSE)

The next step would be to turn the function into an Rcpp which is not done here. Strings are more complicated than numbers in Rcpp (at least for me).

Here's the performance of this function. For the 7 row example, this is 4 times faster. But as we increase the rows, the speed difference becomes less significant:

## 7 rows
Unit: microseconds
   expr      min       lq     mean   median       uq       max
 use_fx  375.801  395.251  662.582  409.751  431.351 21345.701
     OP 1889.901 2021.601 2211.858 2096.101 2285.201  4042.801

## 700,000 rows
Unit: seconds
   expr       min        lq      mean    median        uq       max
 use_fx  4.409595  4.409595  4.409595  4.409595  4.409595  4.409595
     OP 12.592520 12.592520 12.592520 12.592520 12.592520 12.592520

## 7,000,000 rows
Unit: seconds
   expr       min        lq      mean    median        uq       max
 use_fx  43.90979  43.90979  43.90979  43.90979  43.90979  43.90979
     OP 130.16418 130.16418 130.16418 130.16418 130.16418 130.16418

## code used:
DT_big = DT[rep(seq_len(.N), 1e5)]
microbenchmark(
  use_fx = DT_big[, TI3 := exposed(exposure.before.index, exposure)],
  OP = {
    DT_big[,CNT:=1:.N]
    DT_big[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]")) %in% unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
    DT_big[is.na(TI),TI:=FALSE]
  }
  , times = 1L
)

If you are interested in Rcpp, this may be helpful:

https://wckdouglas.github.io/2015/05/string-manipulation

Cole
  • 11,130
  • 1
  • 9
  • 24
  • Could ```exposure.before.index``` have multiple drugs? I assume yes based on your code with the ```strsplit(...)``` – Cole Apr 07 '20 at 00:13