7

How can I subset a data.table by using a variable, when the variable name is identical to an existing column name in the data.table? It works with get("varname",pos = 1), but is there are more robust/flexible solution?

library(data.table)

my_data_frame <- data.frame(
"V1"=c("A","B","C","A"),
"V2"=c(1, 2, 3, 4),
stringsAsFactors = FALSE        
)

V1 <- "A"

my_data_table <- as.data.table(my_data_frame)

# Can I improve this a bit? I want rows where V1 == "A", but use V1 in the statement 
my_data_table[ my_data_table$V1 == get("V1", pos = 1), ]

Renaming V1 is not an option.

UPDATE: I do not consider this a 100% duplicate. The accepted answer for this question is not acceptable for my question, since it uses explicit get which I do not want to use, as stated in the comments.

nilsole
  • 1,663
  • 2
  • 12
  • 28
  • I don't get it, what's wrong with `my_data_table[,"V1"=="A"]` or `my_data_table[,"V1"==V1]`? – user2974951 Sep 24 '18 at 08:49
  • @user2974951 Thanks, but your solutions do not return the desired result, since you do not use data.table syntax. The desired result has two rows. – nilsole Sep 24 '18 at 08:53
  • I do not want to state the level of environment (pos = 1) explicitly as done in the example. Instead, I would like to make R look for an outer object called "V1" rather than using V1 as the column name. The above code works, but will not necessarily work when I copy the code into a different scope. – nilsole Sep 24 '18 at 08:56
  • 3
    Perhaps a bit unorthodox to do row subsetting in `j`, but then we can use the ['dot dot notation'](https://github.com/Rdatatable/data.table/blob/master/NEWS.md#changes-in-v1102--on-cran-31-jan-2017): `d[ , d[V1 == ..V1]]` – Henrik Sep 24 '18 at 09:22
  • 2
    Another option is to specify the environment: `my_data_table[V1 == get("V1", envir = .GlobalEnv)]` – Jaap Sep 24 '18 at 09:24
  • @Henrik Tried your example, but it gives me `Error in eval (expr, envir, enclos): object '..V1' not found` – nilsole Sep 24 '18 at 09:27
  • 1
    It works here (I just used the shorter "d" as name of the data set). Do you have `data.table` version >= v1.10.2? – Henrik Sep 24 '18 at 09:34
  • @Henrik Correct, very good "data.table-only" answer. – nilsole Sep 24 '18 at 09:40
  • 2
    another alternative similar to Henrik: `d[d[, .I[V1 == ..V1]]]` – chinsoon12 Sep 24 '18 at 10:14
  • related: https://stackoverflow.com/q/32738499/4137985 – Cath Sep 24 '18 at 11:50
  • 1
    Possible duplicate of [data.table := assignments when variable has same name as a column](https://stackoverflow.com/questions/32738499/data-table-assignments-when-variable-has-same-name-as-a-column) – h3rm4n Sep 24 '18 at 12:28

3 Answers3

3

Here is a solution using library(tidyverse):

library(data.table)
library(tidyverse)
my_data_frame <- data.frame(
  "V1"=c("A","B","C","A"),
  "V2"=c(1, 2, 3, 4),
  stringsAsFactors = FALSE        
)

V1 = "A"
my_data_table <- as.data.table(my_data_frame)
df = my_data_table %>% filter(V1 == !!get("V1")) #you do not have to specify pos = 1

If you want to make R use the object named "V1" you can do this

V1 = "A"
list_test = split(my_data_table, as.factor(my_data_table$V1)) #create a list for each factor level of the column V1.
df = list_test[[V1]] #extract the desired dataframe from the list using the object "V1"

Is it what you want?

Axeman
  • 32,068
  • 8
  • 81
  • 94
Paul
  • 2,850
  • 1
  • 12
  • 37
  • 1
    Everything is correct here, but perhaps you could give the tidyverse solution for the name collision of `V1`, as that seems to be the gist of the problem here. – Axeman Sep 24 '18 at 12:32
  • Thanks for your suggestion. I am not sure to fully understand how the tidyverse can help to deal with the name collision. The `dplyr::filter` only uses the dataframe column `V1`. It avoids the name collision by removing the need for the object `V1`. But I though this object was needed so I wrote the second solution with `base::split` which allows to use the object `V1` and the dataframe column `V1` without any doubt. Plus, the list_test object is very nice to work with lapply() and custom functions. – Paul Sep 24 '18 at 12:56
  • 1
    In other words, can you write the `filter` statement such that it is guaranteed to use the global variable instead of the column name? (Like the data.table `..` notation.) For the general case? I know one can write `.data$` for the opposite, but I'm not sure how to force the scope to be outside of the data.frame. – Axeman Sep 24 '18 at 13:00
  • 1
    I found this : `V1 = "A" my_data_table <- as.data.table(my_data_frame) df = my_data_table %>% filter(V1 == !!get("V1"))` inspired by this post https://stackoverflow.com/questions/34219912/how-to-use-a-variable-in-dplyrfilter – Paul Sep 24 '18 at 13:20
3

If you don't mind doing it in 2 steps, you can just subset out of the scope of your data.table (though it's usually not what you want to do when working with ...):

wh_v1 <- my_data_table[, V1]==V1
my_data_table[wh_v1]
#   V1 V2
#1:  A  1
#2:  A  4
Cath
  • 23,906
  • 5
  • 52
  • 86
1

For equality conditions, you can use a join:

mDT = data.table(V1)
my_data_table[mDT, on=.(V1), nomatch=0]
#    V1 V2
# 1:  A  1
# 2:  A  4

Implicitly, the join condition in x[i, on=.(V1)] is

V1 == V1

where the LHS comes from x and the RHS from i. It is like a lookup of each row of i in x. The nomatch=0 means that any value found in i but not x is dropped from the output... for example

mDT2 = data.table(V1 = c("A", "D"))
my_data_table[mDT2, on=.(V1)]
#    V1 V2
# 1:  A  1
# 2:  A  4
# 3:  D NA

my_data_table[mDT2, on=.(V1), nomatch=0]
#    V1 V2
# 1:  A  1
# 2:  A  4
Frank
  • 66,179
  • 8
  • 96
  • 180