I have a large dataset (~520,000 rows and 1,000 columns). A subset of those columns are ICD codes. I would like to perform a sum for each row (among the subset of columns that contain ICD codes) that counts all of the columns for which the ICD code entry matches a list of ICD codes of interest. I would then like to make a new indicator column where the value is 1 if the rowsum is over 0 (i.e., if any column has an ICD code that matches my list) and 0 if there are no columns with entries that match the target list of ICD codes. Here's a simple, reproducible example using a fake dataset:
# create fake dataset
id <- c(500, 550, 560)
icd.1 <- c("C00", "F14", "H15")
icd.2 <- c("F10", "G45", "A40")
icd.3 <- c(NA, "A16", "F13")
dat <- as.data.frame(cbind(id, icd.1, icd.2, icd.3))
# vector of ICD codes to search for
icd_include <- c("C00", "G46", "A16", "F13")
# vector of column names to search
icd_all <- paste0("icd.", seq(1,3))
If I have a single character value that I want to match, then rowSums gives me the perfect solution:
dat$event <- ifelse(rowSums(dat[icd_all] == "C00") > 0,
1,
0)
dat
id icd.1 icd.2 icd.3 event
1 500 C00 F10 <NA> 1
2 550 F14 G45 A16 0
3 560 H15 A40 F13 0
I want to perform the rowSums for matches with all entries in the icd_include vector to produce the following output:
id icd.1 icd.2 icd.3 event
1 500 C00 F10 <NA> 1
2 550 F14 G45 A16 1
3 560 H15 A40 F13 1
However, I can't seem to get this to work when trying to match a vector of many possible strings. In my mind, the logical way to try this would be to use the following code:
dat$event <- ifelse(rowSums(dat[icd_all] %in% icd_include, na.rm = TRUE) > 0,
1,
0)
But this doesn't work and produces the following error:
Error in rowSums(dat[icd_all] %in% icd_include) :
'x' must be an array of at least two dimensions
The current workaround solution I have is to create a dummy column for each column of ICD codes, and to then perform rowSums across these:
dat$event.1 <- ifelse(dat[[icd_all[1]]] %in% icd_include,
1,
0)
dat$event.2 <- ifelse(dat[[icd_all[2]]] %in% icd_include,
1,
0)
dat$event.3 <- ifelse(dat[[icd_all[3]]] %in% icd_include,
1,
0)
dat$event <- ifelse(rowSums(dat[event.1:event.3],
na.rm = TRUE) > 0,
1,
0)
But this seems very clunky to me and I would like a more straightforward method where I don't need to create all of these dummy columns. Could anyone please suggest a way to code this? I've tried numerous ways and searched the web exhaustively, but to no avail. Suggested solutions in base R or data.table (to speed up computation) would be particularly appreciated.