1

I am looking for a way to create a new data.table column within a piped sequence using grepl looking for any occurrence of a particular string.

I have looked here and here for help, and there seems to be many questions around this topic but it doesn't seem to directly address my issue.

Also, I may be misunderstanding the data.table syntax and am referencing the Reference semantics vignettes. I have the code below with two approaches that could be piped/chained but don't seem to work. The last option where the data.table column is created explicitly seems to work but I am wondering if it can be chained/piped.

To my understanding, using lapply within a data.table will apply a function to the entire column (ie sum, mean, na.approx which I found out from another posted question) but will not work on a row-wise basis. Also, I can apply a function to each row in a given column using the new_col := function(x). So I would have thought one of those to work.

I am (only somewhat) aware that the grepl is expecting a single value but a vector is being supplied and I am unsure of how to fix that.

Any help is appreciated, thanks.

> library(data.table)
> 
> a = c("housefly",
+       "house fly",
+       "HOUSEFLY",
+       "HOUSE FLY")
> 
> dt = data.table(insect = c("housefly",
+                            "house fly",
+                            "HOUSEFLY",
+                            "HOUSE FLY",
+                            "dragonfly",
+                            "dragon fly"))
> 
> # does not work but I could put this in chain/pipe
> dt[, fly_check := sapply(.SD, grepl, paste(a, collapse = "|")), .SDcols = "insect"]
Warning message:
In FUN(X[[i]], ...) :
  argument 'pattern' has length > 1 and only the first element will be used
> dt
       insect fly_check
1:   housefly      TRUE
2:  house fly      TRUE
3:   HOUSEFLY      TRUE
4:  HOUSE FLY      TRUE
5:  dragonfly      TRUE
6: dragon fly      TRUE
> 
> # does not work but I could put this in chain/pipe
> dt[, fly_check := ifelse(grepl(insect, paste(a, collapse = "|")), TRUE, FALSE)]
Warning message:
In grepl(insect, paste(a, collapse = "|")) :
  argument 'pattern' has length > 1 and only the first element will be used
> dt
       insect fly_check
1:   housefly      TRUE
2:  house fly      TRUE
3:   HOUSEFLY      TRUE
4:  HOUSE FLY      TRUE
5:  dragonfly      TRUE
6: dragon fly      TRUE
> 
> # works but can't be chained/piped
> dt$fly_check = sapply(dt$insect, grepl, pattern = paste(a, collapse = "|"))
> dt
       insect fly_check
1:   housefly      TRUE
2:  house fly      TRUE
3:   HOUSEFLY      TRUE
4:  HOUSE FLY      TRUE
5:  dragonfly     FALSE
6: dragon fly     FALSE
Prevost
  • 677
  • 5
  • 20
  • Sorry, should have made the desired outcome more clear. It is the last printed `dt` in the code output. I am looking for the fly_check column to return `TRUE` if the insect column contains the string "fly". – Prevost Oct 04 '18 at 01:11
  • 1
    I believe you are looking for: `dt[, fly_check := (Vectorize(grepl)(paste0(a, collapse = "|"),.SD)), .SDcols = c("insect")]` – PKumar Oct 04 '18 at 01:22
  • Sure am! That works. You post it I will mark it. Thank you for the quick response @chinsoon12...and now others. – Prevost Oct 04 '18 at 01:23
  • @chinsoon12 if you post your comment as a solution i will mark it as correct as you were the first to respond... – Prevost Oct 04 '18 at 12:05

2 Answers2

3

It seems you are looking for this, although a much simpler solution is given by @chinsoon12 (in the comments) which is just to pass the column name directly:

Logic: In data.table if you call it using .SD parameter , it means subset of data, which also suggest the column is passed not as a vector but as data.table object ( Hence you have to use Vectorize or other operations), On the other hand if you pass it directly as column, grepl has no problem working on vector like structure (@Chinsoon12 solution).

You may check this, its a very enlightening link.

dt[, fly_check := (Vectorize(grepl)(paste0(a, collapse = "|"),.SD)), .SDcols = c("insect")]

which leads to :

#       insect fly_check
#1:   housefly      TRUE
#2:  house fly      TRUE
#3:   HOUSEFLY      TRUE
#4:  HOUSE FLY      TRUE
#5:  dragonfly     FALSE
#6: dragon fly     FALSE
PKumar
  • 10,971
  • 6
  • 37
  • 52
2

I think it's just a simple application of %in% to the a values

dt[ , fly_check := insect %in% a]

Which seems simpler than the application of grepl to the collapsed a values as a pattern.

dt[ , fly_check := grepl( paste0(a, collapse="|") , insect)]

> dt
       insect fly_check
1:   housefly      TRUE
2:  house fly      TRUE
3:   HOUSEFLY      TRUE
4:  HOUSE FLY      TRUE
5:  dragonfly     FALSE
6: dragon fly     FALSE

I suppose you might still choose the grepl strategy if you gained generality with ignore.case=TRUE or needed facilities provided by the 'perl' or 'fixed' parameters.

IRTFM
  • 258,963
  • 21
  • 364
  • 487