0

So I am trying to convert my dplyr into DT for quicker processing time, but I am unable to convert my ifelse(any(startsWith... statement to DT. Whatever I try, it keeps doing one extreme or the other, or with the case of "Tag" it just says it doesn't exist. Maybe the problem is with rowwise but I can't figure it out. Thanks in advance!

Here's my dplyr code:

df <- df %>% 
  rowwise() %>%
  mutate(
    'Position' = coalesce( 
      ifelse(any(c_across(starts_with("Tag")) == "goalkeeper"), "Goalkeeper", NA),
      ifelse(any(c_across(starts_with("Tag")) == "striker"), "Striker", NA),
    ),
    Favorite = ifelse(any(c_across(starts_with("Tag")) == "favorite"), TRUE, FALSE),
    across(starts_with("Tag"), ~ifelse(. %in% c("goalkeeper", "striker", "favorite"), NA_character_, .))
)

my DT attempts

df[, `Position` := coalesce(
  ifelse(any(startsWith(Tag, "goalkeeper")), "Goalkeeper", NA_character_), #tried this
  ifelse(grepl("striker", "^Tag"), "Striker", NA_character_), #and this
)]

df[, Favorite := any(startsWith(Tag1, "favorite"))]

df[, (grep("Tag", names(df), value = TRUE)) :=
             lapply(.SD, function(x) ifelse(x %in% c("goalkeeper", "striker", "favorite"), NA_character_, x)),
           .SDcols = patterns("Tag")]

Data:

Name Tag1 Tag2 Tag3
A goalkeeper NA NA
B NA striker favorite

Expected output:

Name Position Favorite
A Goalkeeper FALSE
B Striker TRUE
  • 1
    Please share a few rows of sample data with the desired output. – Gregor Thomas Jun 10 '23 at 01:53
  • 1
    Since you seem to be aiming for efficiency, note that regex is not efficient for checking equality of strings. If you are checking for a whole string match you can use `==` or `%in%` directly which will be much faster. You only need `grepl` and the like if you are (a) using pattern terms, like `pattern = "(goal)?keeper"` to match "goalkeeper" or just "keeper" or (b) if you are looking for patterns within larger strings, like if your data had entries like `"team goalkeeper"` and `"goalkeeper for team 2"`. – Gregor Thomas Jun 10 '23 at 13:38
  • Your dplyr code does not generate your expected output, is that known? – r2evans Jun 10 '23 at 17:53
  • @r2evans I actually deleted the first part of the code since it was irrelevant... It does generate the expected output with the full code – user165426 Jun 10 '23 at 22:17

2 Answers2

1

Apply function tidyData which finds position/favorite to each row. To do it on rows use transpose. Second transpose is to get wanted data in form of 2 columns.

tidyData <- function(vec){
  vec <- vec[!is.na(vec)]
  c(position = vec[vec != "favorite"], favorite = any(vec == "favorite"))
}

dt[
  , 
  (c("position", "favorite")) := transpose(lapply(transpose(.SD), tidyData)),
  .SDcols = startsWith(names(dt), "Tag")
][, .(name, position, favorite)]

data:

dt <- data.table(
  name = c("A", "B"),
  Tag1 = c("golakeeper", NA),
  Tag2 = c(NA, "striker"),
  Tag3 = c(NA, "favorite")
)
det
  • 5,013
  • 1
  • 8
  • 16
0

Since you're doing multi-column snapshots row-wise, I don't know that there are awesome ways to go about it, but perhaps this is sufficient?

tags <- grep("Tag", names(df), value=TRUE)
tags
# [1] "Tag1" "Tag2" "Tag3"

df[, c("Position", "Favorite") := .(
  apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]), 
  apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags]
df
#      Name       Tag1    Tag2     Tag3   Position Favorite
#    <char>     <char>  <char>   <char>     <char>   <lgcl>
# 1:      A goalkeeper    <NA>     <NA> goalkeeper    FALSE
# 2:      B       <NA> striker favorite    striker     TRUE

(And you can easily remove the tags.)

The use of apply is a little costly in that it causes the frame (.SD, which in this case is just the Tag# columns) to be converted to a matrix internally. It's because of this conversion that the use of apply in the context of frame rows can be expensive, rightfully so.

An alternative:

fun <- function(...) {
  dots <- unlist(list(...))
  list(Position = intersect(c("goalkeeper", "striker"), dots)[1], Favorite = "favorite" %in% dots)
}
df[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags]

The two perform at somewhat the same speed (median, `itr/sec`) but the first has a lower mem_alloc, perhaps suggesting that it may be better for larger data. But don't be too hasty benchmarking on small data ...

bench::mark(
  a = df[, c("Position", "Favorite") := .(
    apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]), 
    apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags],
  b = df[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags],
  min_iterations=10000)
# # A tibble: 2 × 13
#   expression     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 a            243µs  288µs     3262.    16.4KB     12.1  9963    37      3.05s <dt>   <Rprofmem> <bench_tm> <tibble>
# 2 b            253µs  293µs     3109.    48.7KB     10.6  9966    34      3.21s <dt>   <Rprofmem> <bench_tm> <tibble>

Expanding it to be a larger dataset,

dfbig <- rbindlist(replicate(10000, df, simplify=FALSE))

we get these benchmarking results:

bench::mark(
  a = dfbig[, c("Position", "Favorite") := .(
    apply(.SD, 1, function(z) intersect(c("goalkeeper", "striker"), z)[1]), 
    apply(.SD, 1, function(z) "favorite" %in% z)), .SDcols = tags],
  b = dfbig[, c("Position", "Favorite") := rbindlist(do.call(Map, c(list(f=fun), .SD))), .SDcols = tags], 
  min_iterations = 500)
# # A tibble: 2 × 13
#   expression     min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory     time       gc      
#   <bch:expr> <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>     <list>     <list>  
# 1 a            202ms  257ms      3.78    2.69MB    12.5    500  1655      2.21m <dt>   <Rprofmem> <bench_tm> <tibble>
# 2 b            218ms  398ms      2.56  908.43KB     6.19   500  1210      3.26m <dt>   <Rprofmem> <bench_tm> <tibble>

The mem_alloc is lower for the second (Map) implementation, though median and `itr/sec` are a little slower. I don't know which is better in your case.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you, it works! But how do I now capitalize the first letter in the result? In my code, I had, for example, "goalkeeper" that I mutated into "Goalkeeper". – user165426 Jun 11 '23 at 13:01
  • Use `tools::toTitleCase` in `fun`: change to `list(Position = tools::toTitleCase(intersect(c("goalkeeper", "striker"), dots)), ...)` – r2evans Jun 11 '23 at 14:09
  • But there are several other ways: translation table and `left_join` (or data.table's join mechanics); named vector (as a dictionary of sorts) for replacement; and `fcase`/`case_when` are three alternatives. – r2evans Jun 11 '23 at 14:11