Selecting Data Using Conditions Stored in a Variable

Question

Pretend I have this table on a server:

library(dplyr)
library(DBI)

con <- dbConnect(RSQLite::SQLite(), ":memory:")

iris$id = 1:nrow(iris)
dbWriteTable(con, "iris", iris)

I want to select some some random rows from this dataset - suppose I create an R variable that contains the random rows that I want to select:

rows_to_select = sample.int(10, 5, replace = TRUE)
[1] 1 1 8 8 7

I then tried to select these rows from my table - but this "rows_to_select" variable is not being recognized for some reason:

DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (rows_to_select) limit 100;")

Error: no such column: rows_to_select

This code works fine if I manually specify which rows I want (e.g. I want the first row, and the fifth row selected twice):

#works - but does not return the 5th row twice
DBI::dbGetQuery(con, "select a.* from (select *, row_number() over (order by id) as rnum from iris)a where a.rnum in (1,5,5) limit 100;")

Does anyone know how to fix this?

Thank you!

r2evans · Accepted Answer · 2022-06-22T18:41:42.960

In general, just including rows_to_select in a query is not going to know to reach out of the SQLite environment and "invade" the R environment (completely different!) and look for a variable. (For that matter, why doesn't select a.* ... find dplyr::select?) This is the case both for pragmatic reasons and security (though mostly pragmatic).

You may want to consider parameterized queries vice constructing query strings manually. In addition to security concerns about malicious SQL injection (e.g., XKCD's Exploits of a Mom aka "Little Bobby Tables"), it is also a concern for malformed strings or Unicode-vs-ANSI mistakes, even if it's one data analyst running the query. DBI supports parameterized queries.

Long story short, try this:

set.seed(42)
rows_to_select = sample.int(10, 5, replace = TRUE)
rows_to_select
# [1]  1  5  1  9 10
qmarks <- paste(rep("?", length(rows_to_select)), collapse = ",")
qmarks
# [1] "?,?,?,?,?"

DBI::dbGetQuery(con, paste(
  "select a.*
   from (select *, row_number() over (order by id) as rnum from iris) a
   where a.rnum in (", qmarks, ") limit 100;"),
  params = as.list(rows_to_select))
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species id rnum
# 1          5.1         3.5          1.4         0.2  setosa  1    1
# 2          5.0         3.6          1.4         0.2  setosa  5    5
# 3          4.4         2.9          1.4         0.2  setosa  9    9
# 4          4.9         3.1          1.5         0.1  setosa 10   10

In this case it is rather trivial, but if you have a more complicated query where you use question marks ("bindings") at different places in the query, the order must align perfectly with the elements of the list assigned to the params= argument of dbGetQuery.

Alternative: insert a temp table with your candidate values, then left-join against it.

dbWriteTable(con, "mytemp", data.frame(rnum = rows_to_select), temporary = TRUE)

DBI::dbGetQuery(con, 
  "select i.* from mytemp m left join iris i on i.id=m.rnum")
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species id
# 1          5.1         3.5          1.4         0.2  setosa  1
# 2          5.0         3.6          1.4         0.2  setosa  5
# 3          5.1         3.5          1.4         0.2  setosa  1
# 4          4.4         2.9          1.4         0.2  setosa  9
# 5          4.9         3.1          1.5         0.1  setosa 10

DBI::dbExecute(con, "drop table mytemp")
# [1] 0

@ r2evans: thank you so much for this answer! This is a very clever use of the question mark! — stats_noob, Jun 22 '22 at 18:32
A question - in your example, "rows_to_select" had the number "1" appear twice - but in the final result, there is only one row with "rnum = 1". Is there a way to make sure that the final result as two rows with "rnum = 1"? Thank you so much! — stats_noob, Jun 22 '22 at 18:34
That's how set membership works. Logically: *"is `1` in `1,1,1,1,1`"* returns a single `true`. I know you're trying to do sampling with replacement, you won't get that using `where a.rnum in (...)`, whether you use `?`s or not. — r2evans, Jun 22 '22 at 18:35
see my edit, it mimics sampling with replacement, albeit not the most efficiently — r2evans, Jun 22 '22 at 18:43
@ r2evans: thank you so much for all your help! if you have time - can you please check out this question? https://stackoverflow.com/questions/72725108/defining-cte-common-table-expressions-in-r it's about defining CTE's in R. Thank you so much! — stats_noob, Jun 24 '22 at 02:21

Selecting Data Using Conditions Stored in a Variable

1 Answers1

Linked