1

I've been working on this for hours and can't find a solution that works. For simplicity let us say the setup is that we have a DataFrame in R, let's call it df, with a single column of values and let's say it has values 1,2,3,4 and 5.

I want to embed the following query in R:

 SELECT DISTINCT column1,
                 column 2 
 FROM database 
 WHERE value IN (1,2,3,4,5)

If I embed the query in R via the following

df5 <- dbGetQuery(db,paste0("

SELECT DISTINCT column1,
                 column 2 
 FROM database 
 WHERE value IN (1,2,3,4,5)"))

then my query works, but I want to reference this DataFrame which I am pulling in from an Excel file. The natural thing to do would be to convert it to a list

val_list=list(df$'values')

and do the following

df5 <- dbGetQuery(db,paste0("

SELECT DISTINCT column1,
                 column 2 
 FROM database 
 WHERE value IN '",vals,"))

This is, however, not working. How can I get it to work as I want?

PDS
  • 61
  • 5

2 Answers2

1

One should never interpolate data directly into the query, lest accidental sql-injection (or query-poisoning) occurs. It's better to use bound parameters or similar, see https://db.rstudio.com/best-practices/run-queries-safely/.

For this code, assuming you have a set of values you want to check the IN set membership:

#?# vec <- df$values
qmarks <- paste(rep("?", length(vec)), collapse = ",")
df5 <- dbGetQuery(db, paste("
    SELECT DISTINCT column1, column 2 
    FROM database 
    WHERE value IN (", qmarks, ")"),
  params = as.list(vec))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • The advice not to use string interpolation is for situations interfacing to publicly facing sites only. In typical R applications which are more internal computational this advice does not apply. – G. Grothendieck Jan 04 '22 at 03:03
  • 1
    I disagree. The rationale may be for more practical reasons initially, but the end result is the same. Further, "internal" may feel safe enough until a quick hack turns into a small shiny app shared with a coworker that inadvertently or eventually migrates to "public" (or public-enough). I've seen it happen several times, and when this type of bug comes up, it carries some of the same risks. Doing parameter binding is not hard, why not teach it from day one? – r2evans Jan 04 '22 at 09:34
  • IMO in *some* cases of *adHoc* `R` scripts the parameter concatenation may be considered *tolerable* but is *must* be always considered as a **potential security and performance** thread. BTW (in `RJDBC`) I had to use `do.call` to pass mutiple bind parameters, e.g. `do.call(dbSendUpdate, c(jdbcConnection, sql, df$col1, df$col2) ) ` – Marmite Bomber Jan 17 '22 at 17:19
  • @MarmiteBomber, thanks for the comment. I don't know when "performance" (speed of execution) would be the prime motivator, but I won't argue that inefficient query-formation could easily slow down the query. As for the required use of `do.call`, I don't see how it would be required using the method in my answer here. If you see problems using bound parameters (even with `RJDBC`, though I admit to not using `JDBC`), I'd be curious to know. – r2evans Jan 17 '22 at 17:23
  • 1
    You are welcome. The performance problem is basically in the need of parsing on each execution. The concatenated parameters disable the reuse of parsed cursor (in Oracle). The problem with `d.call` - I got `missing IN- or OUT-parameter at index:: 2` while running `dbSendUpdate(conn, "update test set val = '0' where col1 in (?,?)", params = as.list(c(1,2)) )` your syntax works for me with only one bind paramater – Marmite Bomber Jan 17 '22 at 17:54
  • BTW using `list` instead of `params` worked for me: `, list = as.list(c(1,2) )` – Marmite Bomber Jan 18 '22 at 11:53
0

Suppose we wish to insert the Time column from the built in data frame BOD into an sql query.

sprintf("select * from X where Y in (%s)", toString(BOD$Time))
## [1] "select * from X where Y in (1, 2, 3, 4, 5, 7)"
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341