6

I have a small problem that I can't seem to solve. Given two columns:

dt <- data.table(ColumnA = c("A,B,C,A,A,A", "A,B,C"), ColumnB = c("A,C,A", "C"))

I would like to "subtract" columnB from columnA, which will result in:

data.table(Result = c("B,A,A", "A,B"))

How would one achieve this fact without first transforming it into a list and then try to subtract the list? In addition, since the dataset is quite big, it cannot be done using a for loop in R.

Every item in the comma seperated string should be treated as one item and should be subtracted only once if it occurs once. Hence not all A's are gone in the first row.

Snowflake
  • 2,869
  • 3
  • 22
  • 44
  • 2
    Order doesn't matter, right? I assume `"A,B,C" - "C,A" = "B"`? I ask because your `A,C,A` example has the letters in the same order they appear in `A,B,C,A,A,A`. – Gregor Thomas Dec 10 '19 at 21:26
  • 2
    Also, what happens if there is a "negative" result? What is `"A,B" - "B,C"`? – Gregor Thomas Dec 10 '19 at 21:27
  • 1
    Dear Gregor, these are good comments. Negative results should result in nothing actually, since C cannot be deducted from the first list. In addition, order is irrelevant. – Snowflake Dec 10 '19 at 23:34

3 Answers3

6

Another option leveraging the function vecsets::vsetdiff which doesn't remove duplicates:

library(dplyr)
library(tidyr)
library(purrr)
library(vecsets)

dt %>% 
  mutate(x = strsplit(ColumnA,","),
         y = strsplit(ColumnB,",")) %>% 
  mutate(z = map2(x,y,vecsets::vsetdiff))

      ColumnA ColumnB                x       y       z
1 A,B,C,A,A,A   A,C,A A, B, C, A, A, A A, C, A B, A, A
2       A,B,C       C          A, B, C       C    A, B

Note that you end up with list columns here (which I created on purpose for this to work), but the data might be easier to work with that way anyway.

joran
  • 169,992
  • 32
  • 429
  • 468
  • 1
    I was hoping to get away with `stri_encode(Map(vsetdiff, stri_encode(A, to_raw = T), stri_encode(B, to_raw = T)))`, but apparently there's not an `order` method for `raw` vectors, causing `vsetdiff` to error out. – Gregor Thomas Dec 10 '19 at 22:06
  • 1
    Translating into `data.table`: `dt[, list(mapply(vecsets::vsetdiff, strsplit(ColumnA,","), strsplit(ColumnB,",")))]` – s_baldur Dec 11 '19 at 14:05
4
sapply(1:nrow(dt), function(i){
    a = dt$ColumnA[i]
    b = unlist(strsplit(dt$ColumnB[i], ","))
    for (x in b){
        a = sub(paste0(x, ",?"), "", a)
    }
    sub(",$", "", a)
})
#[1] "B,A,A" "A,B"
d.b
  • 32,245
  • 6
  • 36
  • 77
1

Not sure if using string split fails this criteria:

How would one achieve this fact without first transforming it into a list and then try to subtract the list?

Will delete this post if OP decides that this violates OP's criteria.

Here is an option using data.table's anti-join. It takes about 7 seconds for 2 million rows:

library(data.table)
library(stringi) #for fast string processing

dt <- data.table(ColumnA = c("A,B,C,A,A,A", "A,B,C"), ColumnB = c("A,C,A", "C"))
DT_big <- dt[rep(seq(dt[, .N]), 1e6)]

f <- function(DT, coln) {
    res <- DT[, {
        s <- stri_split_fixed(get(coln), ',')
        .(rn=rep(seq_along(s), lengths(s)), S=unlist(s))
    }]
    res[, n := rowid(S)]
}

system.time({
    DTA <- f(DT_big, "ColumnA")
    DTB <- f(DT_big, "ColumnB")
    ans <- DTA[!DTB, on=.(rn, S, n)][, .(Result=paste(S, collapse=",")), .(rn)][, rn := NULL][]
})
ans

timing:

   user  system elapsed 
   7.56    0.33    7.20 

output:

            Result
      1:     B,A,A
      2:       A,B
      3: A,B,A,A,A
      4:       A,B
      5: A,B,A,A,A
     ---          
1999996:       A,B
1999997: A,B,A,A,A
1999998:       A,B
1999999: A,B,A,A,A
2000000:       A,B
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • @IceCreamToucan, sorry i just saw your deleted post after posting. other than using one more joining key we are both using an anti-join and timing difference. may i know why you deleted yours? – chinsoon12 Dec 11 '19 at 01:02
  • Despite OP's wishes, all answer so far split the string... it's hard to imagine an answer that doesn't written in R (as opposed to to, say, Rcpp). I say, keep your answer, and I'm sure someone will come along to benchmark them all. – Gregor Thomas Dec 11 '19 at 01:12
  • @Gregor, thanks, i have timed d.b and IceCreamToucan code as well. one is around 64s and the other is still running after a few mins. i do not have tidyverse installed to time joran's solution – chinsoon12 Dec 11 '19 at 01:25