R: Compare character strings across multiple columns to character string in a single column by row

Question

I am trying to create a variable that is a logical value when comparing one character string to more than two other character strings in a data.table and I need to ignore NA's.

Sample data for D2:

structure(list(ID = c("a001", "a002", "a003"), var1 = c("char1", 
"char1", "char2"), var2 = c("char1", NA, "char2"), var3 = c("char1", 
"char1", "char1")), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x0000015eb1261ef0>)

Attempted a proposed solution below:

D2[, Match := apply(sapply(.SD, `==`, D2[, "var1"]), 1, any), .SDcols = 
c("var2", "var3")]

Result for a003 is TRUE whereas it should be FALSE because var1 and var3 don't match:

structure(list(ID = c("a001", "a002", "a003"), var1 = c("char1", 
"char1", "char2"), var2 = c("char1", NA, "char2"), var3 = c("char1", 
"char1", "char1"), Match = c(TRUE, TRUE, TRUE)), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 
0x0000015eb1261ef0>)

Desired Result:

structure(list(ID = c("a001", "a002", "a003"), var1 = c("char1", 
"char1", "char2"), var2 = c("char1", NA, "char2"), var3 = c("char1", 
"char1", "char1"), Match = c(TRUE, TRUE, FALSE)), row.names = c(NA, 
-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 
0x0000015eb1261ef0>)

Your desired result is identical to your sample data. Did you forget to include the proper expected output? — Maurits Evers, Sep 17 '19 at 22:05
@Maurits Evers Never fails. I just edited the desired result. — user3594490, Sep 17 '19 at 22:26
Thanks for updating. Do you only want to compare `var1` with `var2`? What about `var3`? I'm still trying to understand the rules/logic behind your expected output. For row 3, is `Match = FALSE` because `var1 != var2`? Why is `Match = TRUE` for row 2? — Maurits Evers, Sep 17 '19 at 22:55
@MauritsEvers Yes exactly, I have more than two columns that I need to match with var1. I can not find one example of using mapply with more than two variables. ID a003 is not a match because var1 does not match one of either var2 or var3. — user3594490, Sep 17 '19 at 23:08

Maurits Evers · Accepted Answer · 2019-09-18T00:16:58.270

4

How about the following

setDT(D1)
D1[, Match := apply(sapply(.SD, `==`, D1[, "var1"]), 1, any), .SDcols = c("var2", "var3")]
D1
#ID  var1  var2  var3 Match
#1: a001 char1 char1 char1  TRUE
#2: a002 char1  <NA> char1  TRUE
#3: a003 char2 char1 char1 FALSE

Explanation: We compare entries in the sub-data.table defined through .SDcols with entries in D1[, "var1"]; if there is any match, return TRUE, else FALSE.

Update

In response to your comment, you can do

setDT(D1)
D1[, Match := apply(sapply(.SD, `==`, D1[, "var1"]), 1, all, na.rm = T), .SDcols = c("var2", "var3")]

edited Sep 18 '19 at 00:16

answered Sep 17 '19 at 23:25

Maurits Evers

49,617
4
47
68

If I change the value of var2 for ID a003 from char1 to char2 the result is TRUE whereas it should be FALSE because var3 is not a match with var1 even though var2 is. – user3594490 Sep 17 '19 at 23:41
@user3594490 That was not clear to me; I thought you wanted to have `TRUE` if at least one entry matched. I've made an edit, please take a look. – Maurits Evers Sep 18 '19 at 00:17
1

Yes! This works and my initial explanation needed to be more clear. – user3594490 Sep 18 '19 at 00:45

chinsoon12 · Answer 2 · 2019-09-18T00:52:43.987

Another option is:

D2[, m := Reduce(`&`, data.table(
        sweep(as.matrix(.SD[, -"var1"]), 1L, .SD[["var1"]], function(x, y) is.na(x) | x==y))), 
    .SDcols=var1:var3]

Or using melt:

D2[, m := melt(D2, id.vars=c("ID", "var1"))[, any(var1==value, na.rm=TRUE), ID]$V1]

data:

library(data.table)
D2 <- structure(list(ID = c("a001", "a002", "a003"), var1 = c("char1", 
    "char1", "char2"), var2 = c("char1", NA, "char2"), var3 = c("char1", 
        "char1", "char1")), row.names = c(NA, -3L), class = c("data.table", 
            "data.frame"))
setDT(D2)

score 1 · Answer 3 · answered Sep 18 '19 at 01:12

1

We can do this with only apply by checking if the first value in the row is same as all the other elements in the row.

D2$match <- apply(D2[, -1], 1, function(x) all(x[1] == na.omit(x)))
D2
#     ID  var1  var2  var3 match
#1: a001 char1 char1 char1  TRUE
#2: a002 char1  <NA> char1  TRUE
#3: a003 char2 char2 char1 FALSE

answered Sep 18 '19 at 01:12

Ronak Shah

377,200
20
156
213

could you please explain this logic – jkatam Jul 19 '23 at 05:15

score 0 · Answer 4 · answered Sep 17 '19 at 22:55

0

Should be as simple as defaulting NA values:

else.na <- function(x, y) ifelse(is.na(x), y, x)
dt[,Match:= mapply('%in%', var1, else.na(var2, var1)) ,by = ID]

answered Sep 17 '19 at 22:55

Bulat

6,869
1
29
52

This works for var1 and var2 and produces desired output given my example. How can I apply it to var3 also? For example,only if the value for var1 matches the values for var2 and var3 would the result be TRUE. – user3594490 Sep 17 '19 at 23:19

R: Compare character strings across multiple columns to character string in a single column by row

4 Answers4

Update