Filter rows based on variables "beginning with" strings specified by vector

Question

I'm trying to filter a patient database based on specific ICD9 (diagnosis) codes. I would like to use a vector indicating the first 3 strings of the ICD9 codes.

The example database contains 3 character variables for IC9 codes for each patient visit (var1 to var3).

Below is an example of the data

patient<-c("a","b","c")
var1<-c("8661", "865","8651")
var2<-c("8651","8674","2866")
var3<-c("2430","3456","9089")

observations<-data_frame(patient,var1,var2,var3)

   patient  var1  var2  var3
1       a  8661  8651  2430
2       b  865   8674  3456
3       c  8651  2866  9089

#diagnosis of interest: all beginning with "866" and "867"
dx<-c("866","867")

filtered_data<- filter(observations, var1 %like% dx | var2 %like% dx | var3 %like% dx)

I have tried several approaches including the grep and the %like% functions as you can see above but I haven’t been able to get it working for my case. I would appreciate any help you can provide.

Happy thanksgivings

Albit

"beginning with" translates to `^` in regular expressions, so `dx<-c("^866","^867")`? — lukeA, Nov 23 '16 at 20:00
Thanks lukeA. I tried this also but the results due not seem correct, as patient a is the only one that passes the filter. And I get the error code: Warning messages: 1: In grepl(pattern, vector) : argument 'pattern' has length > 1 and only the first element will be used 2: In grepl(pattern, vector) : argument 'pattern' has length > 1 and only the first element will be used 3: In grepl(pattern, vector) : argument 'pattern' has length > 1 and only the first element will be used — albit paoli, Nov 23 '16 at 20:07
True, make it `dx<-"^866|^867"` (reads like starts with 866 or starts with 867)? — lukeA, Nov 23 '16 at 20:12
In base R, `observations[rowSums(sapply(observations, startsWith, dx)) > 0, ]` — Rich Scriven, Nov 23 '16 at 20:25
This approach would be ideal as each element of the dx vector is not limited to a specific length. For example dx<-c("^866","^867","^2981"). However, I have not been able to get it to work.. — albit paoli, Nov 23 '16 at 20:38
@RichScriven I don't think this works correctly, `startsWith` recycle the prefix and string to the same length and then check correspondingly, which doesn't seem like what OP wants. Also check `startsWith(c("8662", "8673", "8552"), c("855", "866")); [1] FALSE FALSE TRUE`. — Psidom, Nov 23 '16 at 20:51
I'm the author of the CRAN package 'icd'. I'm not sure your end goal here, but if it is comorbidity calculation, or any derivation of group flags based on sets of codes, or sanitizing ICD-9 or ICD-10 codes, then 'icd' can help. https://cran.r-project.org/package=icd — Jack Wasey, Jan 13 '19 at 14:46

score 0 · Answer 1 · answered Nov 23 '16 at 19:59

0

This looks close to what you're looking for, but requires a bit more manipulation:

library(dplyr)
library(stringr)
library(tidyr)

obs2 <- observations %>%
  gather(vars, value, -patient) %>%
  filter(str_sub(value, 1, 3) %in% dx)

# A tibble: 2 × 3
  patient  vars value
    <chr> <chr> <chr>
1       a  var1  8661
2       b  var2  8674

answered Nov 23 '16 at 19:59

maloneypatr

3,562
4
23
33

Thank you maloneypatr. This would work for now. However, the values on the elements on the dx vector with ICD9 codes are not always 3.. For example there are cases in which the filter vector could be dx<-c("866","8674") . In this case, with the above example I would be under filtering patients with diagnosis of "8674" as all "867" would be included. Thanks – albit paoli Nov 23 '16 at 20:29

score 0 · Accepted Answer · edited May 23 '17 at 11:59

You can make a regex pattern from the interest vector and apply it to each column of your data frame except for the patient id, use rowSums to check if there is any var in a row match the pattern:

library(dplyr)
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")

pattern
# [1] "^(866|867)"

filter(observations, rowSums(sapply(observations[-1], grepl, pattern = pattern)) != 0)

# A tibble: 2 × 4
#  patient  var1  var2  var3
#    <chr> <chr> <chr> <chr>
#1       a  8661  8651  2430
#2       b   865  8674  3456

Another option is to use Reduce with lapply:

filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))

# A tibble: 2 × 4
#  patient  var1  var2  var3
#    <chr> <chr> <chr> <chr>
#1       a  8661  8651  2430
#2       b   865  8674  3456

This approach works when you have more then two patterns and different patterns have different character length, for instance, if you have dx as dx<-c("866","867", "9089"):

dx<-c("866","867", "9089")
pattern = paste("^(", paste0(dx, collapse = "|"), ")", sep = "")

pattern
# [1] "^(866|867|9089)"

filter(observations, Reduce("|", lapply(observations[-1], grepl, pattern = pattern)))

# A tibble: 3 × 4
#  patient  var1  var2  var3
#    <chr> <chr> <chr> <chr>
#1       a  8661  8651  2430
#2       b   865  8674  3456
#3       c  8651  2866  9089

Check this and this stack answer for more about multiple or conditions in regex.

code_is_entropy · Answer 3 · 2016-11-23T20:57:23.077

0

You can use apply and ldply

library(plyr)
filtered_obs <- apply(observations, 1, function(x) if(sum(substr(x,1,3) %in% dx)>0){x})
filtered_obs <- plyr::ldply(filtered_obs,rbind)

If you have variable number of characters then this should work-

filtered_obs <- lapply(dx, function(y)
                 {
                  plyr::ldply(apply(observations, 1, function(x) 
                   {
                    if(sum(substr(x,1,nchar(y)) %in% y)>0){x}
                   }), rbind)
                 })

filtered_obs <- unique(plyr::ldply(filtered_obs,rbind))

edited Nov 23 '16 at 20:57

answered Nov 23 '16 at 20:26

code_is_entropy

611
3
11

Thank you for this solution. The only disadvantage is the limit to 3 characters for dx. – albit paoli Nov 23 '16 at 20:53
Edited for different number of characters – code_is_entropy Nov 23 '16 at 20:57

Filter rows based on variables "beginning with" strings specified by vector

3 Answers3

Linked

Related