1

I have a data table "dt" like the following:

    a b1 b2 b3 b4     t    t1
 1: 1  4  1  9 NA FALSE  TRUE
 2: 2  5  1  9  2 FALSE  TRUE
 3: 3  6  1  9 NA FALSE FALSE

library(data.table)
dt = setDT(structure(list(a = 1:3, b1 = 4:6, b2 = c(1L, 1L, 1L), b3 = c(9L, 
9L, 9L), b4 = c(NA, 2L, NA), t = c(FALSE, FALSE, FALSE), t1 = c(TRUE, 
TRUE, FALSE)), .Names = c("a", "b1", "b2", "b3", "b4", "t", "t1"
), row.names = c(NA, -3L), class = "data.frame"))

I want to create column t1 that is true if row-by-row the value of "a" is in any of the "b" columns such as:

dt[,t1 := a %in% c(b1,b2,b3,b4)]

However, I'm using this code in a function and the number of b columns may vary, so I'd like to refer to them dynamically.

I've tried something like this:

dt[,t := a %in% paste0("b" 1:4)]

but it looks for the values of a in the strings "b1", "b2", "b3", or "b4" and returns FALSE. Is there a way to refer to the columns of a data table dynamically in j?

Frank
  • 66,179
  • 8
  • 96
  • 180
wellshux
  • 11
  • 2
  • It seems that your code does not respect the "row-by-row" condition. E.g. if `dt$a[3] = 2`, then `dt$t1[3]` is `TRUE`. – alexis_laz Jan 05 '17 at 22:18
  • Could also try something like `melt(dt, id = "a", measure.vars = patterns("b"))[!is.na(value), any(value == .BY[[1L]]), by = a]` – David Arenburg Jan 05 '17 at 22:52

2 Answers2

1

This does what you want. It uses the grepl function to grab the column numbers in which the column name has a b in its first character. Then you use those column indexes to test whether the a column value is equal to any of those in the b columns.

Sum across the Boolean results of that comparison (avoiding NA values) and test for the sum > 0 (telling us that at least one value was TRUE) and assign those values to the t1 column.

This also has the virtue (if it is one) that the b columns do not have to be contiguous.

bcols = which(grepl('^b', names(dt)))
dt$t1 = apply((dt$a == dt[,bcols]), 1, sum, na.rm=T) > 0
Edward Carney
  • 1,372
  • 9
  • 7
  • Why would you run an inefficient by row loop for such simple vectorized operation? Why not just `rowSums(dt[, "a"] == dt[, bcols], na.rm = TRUE) > 0`? – David Arenburg Jan 05 '17 at 22:41
  • Good point. I just "blurted" my first thought. I appreciate the refinement. – Edward Carney Jan 05 '17 at 22:44
  • dt[,bcols] doesn't seem to pull up the subtable. It just creates a vector of the column numbers – wellshux Jan 06 '17 at 00:51
  • @wellshux This answer uses data.frame syntax. You can use `dt[, bcols, with=FALSE]` there or change your data.table to data.frame first, with `setDF`. – Frank Jan 06 '17 at 02:10
1

Here's an attempt where you can use .SDcols to specify b1:bn columns, and then compare their values to that in the a column:

dt[, Reduce(`+`, lapply(.SD, function(x) x==a & (!is.na(x)) )) > 0, .SDcols=b1:b4]
#[1]  TRUE  TRUE FALSE

Where dt was:

dt <- data.table(a = 1:3, b1 = 4:6, b2 = c(1L, 1L, 1L), b3 = c(9L, 
9L, 9L), b4 = c(NA, 2L, NA), t = c(FALSE, FALSE, FALSE), t1 = c(TRUE, 
TRUE, FALSE))
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Or `dt[, rowSums(.SD == dt$a, na.rm = TRUE) > 0, .SDcols = b1:b4]` :) But I guess it's not so idiomatic – David Arenburg Jan 05 '17 at 22:47
  • @DavidArenburg - I doubt `rowSums` causes any huge slowdown but I always try to stick to `list` like objects in `data.table` if I can. I thought I had a benchmark of this somewhere but I can't find it for the life of me. – thelatemail Jan 05 '17 at 22:56
  • 1
    Instead of `Reduce`, I guess `pmax` fits: `dt[, as.logical(do.call(pmax,c(lapply(.SD, \`==\`, a), na.rm=TRUE))), .SDcols=b1:b4]` – Frank Jan 05 '17 at 23:27
  • @Frank - nice. I like it. – thelatemail Jan 05 '17 at 23:31