1

I have created a custom function which adds a category based on a grepl match, using data.table by assignment reference. If there is already a category assigned, it will paste the new category to the existing value.

I want the function to not re-paste the category if it already exists within the string, which I am trying to accomplish via grepl, and is where I am running into problems. When I test the syntax outside of the function, it behaves as expected. I have created a reprex below.

library(data.table)


## create data.table data frame
data <- data.table(search_term = c("Internal Search Terms"), category = NA)
data[, category := as.character(category)]

## custom function
add_cat <- function(df, pat, cat){
  
  ## if not NA, paste to existing term
  df[!is.na(category) & 
       grepl(pat, search_term, ignore.case = T) &
       !grepl(paste0('/', pat, '$|','/', pat, '/'), category, ignore.case = T), # looking for pattern already existing here
     category := paste(category, cat, sep = "/")]
  
  ## add category if NA
  df[is.na(category) & grepl(pat, search_term, ignore.case = T), category := cat]
  
}

## add testing
add_cat(data, "Internal Search Terms", "Testing")
head(data)

## add 123
add_cat(data, "Internal Search Terms", "123")
head(data) 

## try to add 123 again, it shouldn't but it does
add_cat(data, "Internal Search Terms", "123")
head(data)

## test condition outside of the function 

## using paste function
pattern <- "123"

## returns 0 rows
data[!is.na(category) & 
     grepl(pattern, search_term, ignore.case = T) &
     !grepl(paste0('/', pattern, '$|','/', pattern, '/'), category, ignore.case = T)]

## using raw values - also returns 0 rows
data[!is.na(category) & 
       grepl(pattern, search_term, ignore.case = T) &
       !grepl('/123$|/123/', category, ignore.case = T)]
dmunslow
  • 149
  • 1
  • 7
  • When I try your sample code, it errors with "item 2 has 0 rows". Have you run this code without error? What versions of R and `data.table` are you using? – r2evans Jul 29 '20 at 15:51
  • Strange, not sure what was going on there, I got the same error on my other machine. I have fixed the code and I am still having the same issue. – dmunslow Jul 29 '20 at 16:04

1 Answers1

2

Edit: I'll keep the original string-concatenation method at the bottom as reference, but it will fail if any tags are perfect subsets of previous tags, as in:

add_cat(data, "Internal Search Terms", "Testing Here")
#              search_term     category
# 1: Internal Search Terms Testing Here
add_cat(data, "Internal Search Terms", "Testing")
#              search_term     category
# 1: Internal Search Terms Testing Here

(Because with word-boundaries as I suggested, "Testing" is matched by "Testing Here".) While one could shift from word-boundaries to sep-boundaries, you are then restricted from having sep in a valid tag. That might be safe for your one application, but it is not safe "generally".)

Because of this, I think the list-column approach is the preferred one and my recommendation.


If you're planning on using this as a "set" later, thinking that you'll strsplit(..., "/"), here's an alternative that keeps all tags separate by storing them in a list-column:

add_cat2 <- function(x, pat, cat) {
  isna <- lengths(x$category) < 1 | sapply(x$category, function(a) all(is.na(a)))
  match <- !isna & !sapply(x$category, function(a) tolower(cat) %in% tolower(a))
  x[ isna, category := list(cat) ][ match, category := lapply(category, c, cat) ][]
}

data2 <- data.table(search_term = c("Internal Search Terms"), category = list(NA_character_))
data2
#              search_term category
# 1: Internal Search Terms       NA
add_cat2(data2, "Internal Search Terms", "Testing")
#              search_term category
# 1: Internal Search Terms  Testing
add_cat2(data2, "Internal Search Terms", "Testing")
#              search_term category
# 1: Internal Search Terms  Testing
add_cat2(data2, "Internal Search Terms", "123")
#              search_term    category
# 1: Internal Search Terms Testing,123
add_cat2(data2, "Internal Search Terms", "Testing")
#              search_term    category
# 1: Internal Search Terms Testing,123
add_cat2(data2, "Internal Search Terms", "123")
#              search_term    category
# 1: Internal Search Terms Testing,123
add_cat2(data2, "Internal Search Terms", "123")
#              search_term    category
# 1: Internal Search Terms Testing,123
data2$category
# [[1]]
# [1] "Testing" "123"    

(It's just as appropriate to initiate the category column with list() instead of list(NA_character_), as that will still match in the isna conditional by using lengths(.) < 1.)

Last point: the trailing [] in the functions are merely so that the object will print correctly (the first time) on the console, per this comment and https://github.com/Rdatatable/data.table/blob/master/NEWS.0.md#bug-fixes-1.


Older word-boundaries method:

I think your pattern matching is based on pat when it should be based on cat. Here's effectively the same function, with a couple of changes: I move the conditionals outside of the data.table code, and I use word-boundaries instead of including the sep in the pattern.

add_cat <- function(x, pat, cat, sep = "/") {
  isna <- is.na(x$category)
  match <- !isna & !grepl(paste0("\\b", cat, "\\b"), x$category, ignore.case = TRUE)
  x[ isna, category := cat ][ match, category := paste(category, cat, sep = sep) ][]
}

data <- data.table(search_term = c("Internal Search Terms"), category = NA_character_)
data
#              search_term category
# 1: Internal Search Terms     <NA>
add_cat(data, "Internal Search Terms", "Testing")
#              search_term category
# 1: Internal Search Terms  Testing
add_cat(data, "Internal Search Terms", "Testing")
#              search_term category
# 1: Internal Search Terms  Testing
add_cat(data, "Internal Search Terms", "Testing")
#              search_term category
# 1: Internal Search Terms  Testing
add_cat(data, "Internal Search Terms", "123")
#              search_term    category
# 1: Internal Search Terms Testing/123
add_cat(data, "Internal Search Terms", "Testing")
#              search_term    category
# 1: Internal Search Terms Testing/123
add_cat(data, "Internal Search Terms", "123")
#              search_term    category
# 1: Internal Search Terms Testing/123
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • That works perfectly, thank you! That list addition is neat as well. Just out of curiosity, what is the upside to creating the booleans as variables rather then having them inside inside the DT call? i.e. isna <- is.na(x$category). Is it a readability thing? – dmunslow Jul 29 '20 at 16:15
  • Readability and debugging. For the latter, I find it's preferable when debugging to see what conditionals are being used in a DT conditional assignment; if the conditionals are inside the `[]`, then I end up running them manually and capture it to troubleshoot where it is misbehaving. When I set them up as variables like these, it is both easier to debug and, in my opinion, a lot easier to read. Multil-line `data.table` code can get complicated, I just like to keep it simple. – r2evans Jul 29 '20 at 16:17
  • I don't think there's any *performance* difference between them. If you're playing code-golf, feel free to concatenate (and even use `T` instead of `TRUE`), but (again, imho) code-golf often leads to less-maintainable or at best less-readable code. – r2evans Jul 29 '20 at 16:19
  • (Further, I was already considering the list-column approach when I rewrote the first function, so I was planning on breaking out `isna` already. And one last point, "visually complicated" conditionals in data.table get a little more complicated when adding parentheses and dollars and other non-letters. The more nested parens/braces/brackets are present, the more I have to hunt around for the matching ends to make sure I *see* things correctly. It's a habit learned over many years of coding, it's not always self-evident or strictly necessary everywhere.) – r2evans Jul 29 '20 at 16:22
  • That all makes a lot of sense, thank you for your thorough explanations! – dmunslow Jul 29 '20 at 16:33
  • dmunslow, I thought about it some more and found a corner-case where the word-boundary method can fail. – r2evans Jul 29 '20 at 16:57