How to use custom function to create new binary variables within existing dataframe?

Question

I'm trying to create a custom function that generates new binary variables in an existing dataframe. The idea is to be able to feed the function with the diagnosis description (string), ICD9 diagnosis code (number), and patient database. The function would then generate new variables for all the diagnosis of interest and assign a 0 or 1 if the patient (row or observation) has the diagnosis.

Below are the function variables:

x<-c("2851") #ICD9 for Anemia
y<-c("diag_1") #Primary diagnosis 
z<-"Anemia"  #Name of new binary variable for patient dataframe
i<-patient_db #patient dataframe

patient<-c("a","b","c")
diag_1<-c("8661", "2851","8651")
diag_2<-c("8651","8674","2866")
diag_3<-c("2430","3456","9089")

patient_db<-data_frame(patient,diag_1,diag_2,diag_3)

  patient  diag_1 diag_2 diag_3
1       a  8661   8651   2430
2       b  2851   8674   3456
3       c  8651   2866   9089

Below is the function:

diagnosis_func<-function(x,y,z,i){

pattern = paste("^(", paste0(x, collapse = "|"), ")", sep = "")

i$z<-ifelse(rowSums(sapply(i[y], grepl, pattern = pattern)) != 0,"1","0")

}

This is what I would like to get at after running the function:

  patient  diag_1 diag_2 diag_3  Anemia
1       a  8661   8651   2430      0
2       b  2851   8674   3456      1
3       c  8651   2866   9089      0

The lines within the function have been tested outside the function and are working. Where I'm stuck is trying to get the function working. Any help would be greatly appreciated.

Happy New Year

Albit

I think you miss a return value in your function. Simply adding a line `return(i)` in your function should solve the problem. — raymkchow, Jan 04 '17 at 02:57
Thanks for your prompt reply Raymkchow. I just tried return(i), it populates the dataframe in the console but does not add the new variable. — albit paoli, Jan 04 '17 at 03:04
Objects in R are immutable so you cannot pass `i` by reference. You have to assign the variable like `patient_db <- diagnosis_func(x,y,z,i)`. Also, the fourth line of your code (`i<-patient_db #patient dataframe`) should be put after the declaration of `patient_db` to get the correct `i`. — raymkchow, Jan 04 '17 at 03:13
Yes, the order was for explanatory purposes only, in the actual code only variables x and y are defined. "Anemia" and patient_db (already in global environment) are defined as function arguments. — albit paoli, Jan 04 '17 at 03:27
This is how I'm passing the arguments to the function: 'diagnosis_func(x,y,"Anemia",patient_db)' — albit paoli, Jan 04 '17 at 04:43

score 1 · Accepted Answer · answered Jan 04 '17 at 11:57

If you are intending to only work with one diagnosis at a time, this will work. I took the liberty of renaming arguments to be a little easier to work with in the code.

diagnosis_func <- function(data, target_col, icd, new_col){
  pattern <- sprintf("^(%s)", 
                     paste0(icd, collapse = "|"))

  data[[new_col]] <- grepl(pattern = pattern, 
                           x = data[[target_col]]) + 0L
  data
}

diagnosis_func(patient_db, "diag_1", "2851", "Anemia")

# Multiple codes for a single diagnosis
diagnosis_func(patient_db, "diag_1", c("8661", "8651"), "Dx")

If you want to spruce it up a little to prevent inadvertent mistakes, you can install the checkmate package and use this version. This will

diagnosis_func <- function(data, target_col, icd, new_col){

  coll <- checkmate::makeAssertCollection()

  checkmate::assert_class(x = data,
                          classes = "data.frame",
                          add = coll)

  checkmate::assert_character(x = target_col,
                              len = 1,
                              add = coll)

  checkmate::assert_character(x = icd,
                              add = coll)

  checkmate::assert_character(x = new_col,
                              len = 1,
                              add = coll)

  checkmate::reportAssertions(coll)

  pattern <- sprintf("^(%s)", 
                     paste0(icd, collapse = "|"))

  data[[new_col]] <- grepl(pattern = pattern, 
                           x = data[[target_col]]) + 0L
  data
}

diagnosis_func(patient_db, "diag_1", "2851", "Anemia")

Benjamin -- Sorry, but was traveling. When I finally tried the codes neither worked. When you tried it, did the "patient_db" database increased the number of columns by one? — albit paoli, Jan 06 '17 at 08:25
After looking at this again, I'm getting the exact output described in your question. Could you describe in more detail what "neither worked" looks like? Are you getting error messages? Are you saving your result to an object (`patient_db` <- diagnosis_func(patient_db, ...)`)? — Benjamin, Jan 06 '17 at 10:30
I was not saving it as an object. It works now! Thank you Benjamin. — albit paoli, Jan 06 '17 at 12:27
Benjamin--- I've been trying to use the same code to filter 2 or more target_cols columns Ex. diagnosis_func(patient_db, c("diag_1","diag_2"), "2851", "Anemia") ... But I'm getting the following error: Error in .subset2(x, i, exact = exact) : subscript out of bounds ---Do you know how I could get it to work? Thanks — albit paoli, Jan 14 '17 at 17:46

How to use custom function to create new binary variables within existing dataframe?

1 Answers1

Linked