Counting string matches according to a target vector across rows in a data frame

Question

I have a large dataset (~520,000 rows and 1,000 columns). A subset of those columns are ICD codes. I would like to perform a sum for each row (among the subset of columns that contain ICD codes) that counts all of the columns for which the ICD code entry matches a list of ICD codes of interest. I would then like to make a new indicator column where the value is 1 if the rowsum is over 0 (i.e., if any column has an ICD code that matches my list) and 0 if there are no columns with entries that match the target list of ICD codes. Here's a simple, reproducible example using a fake dataset:

# create fake dataset
id <- c(500, 550, 560)
icd.1 <- c("C00", "F14", "H15")
icd.2 <- c("F10", "G45", "A40")
icd.3 <- c(NA, "A16", "F13")
dat <- as.data.frame(cbind(id, icd.1, icd.2, icd.3))

# vector of ICD codes to search for
icd_include <- c("C00", "G46", "A16", "F13")

# vector of column names to search
icd_all <- paste0("icd.", seq(1,3))

If I have a single character value that I want to match, then rowSums gives me the perfect solution:

dat$event <- ifelse(rowSums(dat[icd_all] == "C00") > 0,
                    1,
                    0)

dat
   id icd.1 icd.2 icd.3 event
1 500   C00   F10   <NA>    1
2 550   F14   G45   A16     0
3 560   H15   A40   F13     0

I want to perform the rowSums for matches with all entries in the icd_include vector to produce the following output:

   id icd.1 icd.2 icd.3 event
1 500   C00   F10   <NA>    1
2 550   F14   G45   A16     1
3 560   H15   A40   F13     1

However, I can't seem to get this to work when trying to match a vector of many possible strings. In my mind, the logical way to try this would be to use the following code:

dat$event <- ifelse(rowSums(dat[icd_all] %in% icd_include, na.rm = TRUE) > 0,
                    1,
                    0)

But this doesn't work and produces the following error:

Error in rowSums(dat[icd_all] %in% icd_include) : 
  'x' must be an array of at least two dimensions

The current workaround solution I have is to create a dummy column for each column of ICD codes, and to then perform rowSums across these:

dat$event.1 <- ifelse(dat[[icd_all[1]]] %in% icd_include,
                      1,
                      0)

dat$event.2 <- ifelse(dat[[icd_all[2]]] %in% icd_include,
                      1,
                      0)

dat$event.3 <- ifelse(dat[[icd_all[3]]] %in% icd_include,
                      1,
                      0)

dat$event <- ifelse(rowSums(dat[event.1:event.3],
                            na.rm = TRUE) > 0,
                    1,
                    0)

But this seems very clunky to me and I would like a more straightforward method where I don't need to create all of these dummy columns. Could anyone please suggest a way to code this? I've tried numerous ways and searched the web exhaustively, but to no avail. Suggested solutions in base R or data.table (to speed up computation) would be particularly appreciated.

`dat$event <- as.integer(rowSums(sapply(dat[-1], \`%in%\`, icd_include), na.rm = TRUE) > 0)` — Ronak Shah, Feb 28 '21 at 12:36
Thanks, @RonakShah! This worked perfectly. And super fast despite a large dataset. — Austin, Feb 28 '21 at 15:42

Counting string matches according to a target vector across rows in a data frame

0 Answers0