15

I am using a vector of column names to select a subset of columns of a data.table. I had the idea if it's possible to basically define conditions in i which are then applied to all the selected columns. For example using the mtcars dataset. I would like to select the columns cylinder and gear and then would like to filter on all cars which have four cylinders and four gears. Of course I would also need to define if it is and or or for the filter, but I am just interested if the idea can be applied somehow in the data.table context.

# working code
sel.col <- c("cyl", "gear")
dt <- data.table(mtcars[1:4,])

dt[, ..sel.col]
dt[cyl == 4 & gear == 4, ..sel.col]    


# Non-working code
dt[ sel.col == 4 , ..sel.col]
hannes101
  • 2,410
  • 1
  • 17
  • 40
  • Somewhat related question: [Usage of multiple keys in data.table to obtain a conditioned search](https://stackoverflow.com/q/48573318/3817004). – Uwe Feb 06 '18 at 13:46

2 Answers2

14

We could use get

sel.col <- "cyl"
dt[get(sel.col) == 4, ..sel.col]
#    cyl gear
#1:   4    4

or eval(as.name)

dt[eval(as.name(sel.col)) == 4, ..sel.col]
#    cyl gear
#1:   4    4

Earlier, we thought that there is only a single column to be evaluated. If we have more than one column, specify it in the .SDcols, loop through the Subset of Data.table (.SD) compare it with the value of interest ('4'), Reduce it to logical vector with | i.e. any TRUE in each of the rows and subset the rows based on this

dt[dt[, Reduce(`|`, lapply(.SD, `==`, 4)),.SDcols = sel.col], ..sel.col]
hannes101
  • 2,410
  • 1
  • 17
  • 40
akrun
  • 874,273
  • 37
  • 540
  • 662
  • But there's no way to change that to an `or` condition, right? – hannes101 Feb 06 '18 at 13:03
  • @hannes101 I am not getting your query. – akrun Feb 06 '18 at 13:04
  • At the moment all columns need to have the same values, the other possibility would be that only one of the conditions need to be fulfilled, for example only four gears or four cylinders. Then the result would have more possible combinations. – hannes101 Feb 06 '18 at 13:06
  • Yeah, seems to work. Although it's pretty difficult to read and I would probably struggle to grasp the stuff again, when looking at the code in a couple of months time. – hannes101 Feb 06 '18 at 15:19
  • 1
    @hannes101 If there are multiple elements in the vector, it is better to go through this route. Or else you can do `dt[get(sel.col[1]) == 4 | get(sel.col[2])==4, ..sel.col]` which will become an exercise when the length of 'sel.col' increases – akrun Feb 06 '18 at 15:21
  • 1
    I think it is a bit confusing the you kept the first part of the answer that is only about using `get`. While only the second part of the answer replies to the original problem, I would suggest removeing the first part – Max M Nov 07 '18 at 15:27
0

I know this is a rather old thread but I think the community would still benefit from a simpler solution.

We can use data.table join to deal with the case of multiple columns

# create test data
sel.col <- c("cyl", "gear")
dt <- data.table(mtcars)

# create key / index
mykey <- setNames(list(4,4), sel.col)

# select rows with values specified in mykey
dt <- dt[mykey, on = sel.col]
dt[, ..sel.col]

Or in one line:

dt[setNames(list(4,4), sel.col), ..sel.col, on = sel.col]

The subset conditions can be further generalized using non-equi join, eg cyl >= 4. This method has been described here.

abcoxyzide
  • 11
  • 2