Edit: I'll keep the original string-concatenation method at the bottom as reference, but it will fail if any tags are perfect subsets of previous tags, as in:
add_cat(data, "Internal Search Terms", "Testing Here")
# search_term category
# 1: Internal Search Terms Testing Here
add_cat(data, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing Here
(Because with word-boundaries as I suggested, "Testing"
is matched by "Testing Here"
.) While one could shift from word-boundaries to sep
-boundaries, you are then restricted from having sep
in a valid tag. That might be safe for your one application, but it is not safe "generally".)
Because of this, I think the list-column approach is the preferred one and my recommendation.
If you're planning on using this as a "set" later, thinking that you'll strsplit(..., "/")
, here's an alternative that keeps all tags separate by storing them in a list-column:
add_cat2 <- function(x, pat, cat) {
isna <- lengths(x$category) < 1 | sapply(x$category, function(a) all(is.na(a)))
match <- !isna & !sapply(x$category, function(a) tolower(cat) %in% tolower(a))
x[ isna, category := list(cat) ][ match, category := lapply(category, c, cat) ][]
}
data2 <- data.table(search_term = c("Internal Search Terms"), category = list(NA_character_))
data2
# search_term category
# 1: Internal Search Terms NA
add_cat2(data2, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing
add_cat2(data2, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing
add_cat2(data2, "Internal Search Terms", "123")
# search_term category
# 1: Internal Search Terms Testing,123
add_cat2(data2, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing,123
add_cat2(data2, "Internal Search Terms", "123")
# search_term category
# 1: Internal Search Terms Testing,123
add_cat2(data2, "Internal Search Terms", "123")
# search_term category
# 1: Internal Search Terms Testing,123
data2$category
# [[1]]
# [1] "Testing" "123"
(It's just as appropriate to initiate the category
column with list()
instead of list(NA_character_)
, as that will still match in the isna
conditional by using lengths(.) < 1
.)
Last point: the trailing []
in the functions are merely so that the object will print correctly (the first time) on the console, per this comment and https://github.com/Rdatatable/data.table/blob/master/NEWS.0.md#bug-fixes-1.
Older word-boundaries method:
I think your pattern matching is based on pat
when it should be based on cat
. Here's effectively the same function, with a couple of changes: I move the conditionals outside of the data.table
code, and I use word-boundaries instead of including the sep
in the pattern.
add_cat <- function(x, pat, cat, sep = "/") {
isna <- is.na(x$category)
match <- !isna & !grepl(paste0("\\b", cat, "\\b"), x$category, ignore.case = TRUE)
x[ isna, category := cat ][ match, category := paste(category, cat, sep = sep) ][]
}
data <- data.table(search_term = c("Internal Search Terms"), category = NA_character_)
data
# search_term category
# 1: Internal Search Terms <NA>
add_cat(data, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing
add_cat(data, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing
add_cat(data, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing
add_cat(data, "Internal Search Terms", "123")
# search_term category
# 1: Internal Search Terms Testing/123
add_cat(data, "Internal Search Terms", "Testing")
# search_term category
# 1: Internal Search Terms Testing/123
add_cat(data, "Internal Search Terms", "123")
# search_term category
# 1: Internal Search Terms Testing/123