4

I have a dataframe in R which contains the output of previous queries. Unfortunately, I cannot do this directly in SQL since it is too slow so I am using the data.table package. The output from the data.table package is a data frame of 50,000 ids. I need to pull all records from the database for each id.

# x is a dataframe containing 50,000 ids. 

Usually, I would do something like,

dbGetQuery(con, "Select * from data where id in x") 

but that won't work. An alternative is to do 50,000 queries in a for loop, but I am thinking that there must be a more efficient method to do this.

What is the most efficient way to do this?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • I don't know what MySQL's limit on the number of items in an `IN` clause are, but suspect (??) that it's pretty large. Are you sure you can't just put all or most of them in one `IN` clause? (Another option of course is to push the ids to a temporary table in the db and do a join.) – joran Oct 28 '15 at 21:42
  • what do you mean `but that wont work` ? – Drew Oct 28 '15 at 21:44
  • how about if you step back, describe the tables, and what you want to achieve, so we don't go down the [XY Problem](http://meta.stackexchange.com/questions/66377) path – Drew Oct 28 '15 at 21:46
  • dbGetQuery(con, "select * from data where order_id in x"). I get the following error: Error in .local(conn, statement, ...) : could not run statement: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'x' at line 1 – quant actuary Oct 28 '15 at 21:47
  • You mean something like `subset(data, id == x`)? – ChrKoenig Oct 28 '15 at 21:48
  • ok, it is a **syntax error**. how about select * from data where order_id in (select order_id from camels where humps=2). Or a join – Drew Oct 28 '15 at 21:48
  • For example, x = c(0,1,2,3) – quant actuary Oct 28 '15 at 21:48
  • 2
    In order to do `IN` clause queries like this from R, you'll need to explicitly build the query via string concatenation using `paste()` or `sprintf()`. – joran Oct 28 '15 at 21:51
  • joran, can you provide an example? I would be using paste, but running that 50,000 times, once for each id. The main question would be how to pull data for all ids in x at the same time. – quant actuary Oct 28 '15 at 21:54
  • schema published would be helpful. Else telepathy – Drew Oct 28 '15 at 21:56

1 Answers1

8

For example,

x <- 0:3
> q <- "select * from table where id in (%s)"
> sprintf(q,paste(x,collapse = ","))
[1] "select * from table where id in (0,1,2,3)"

As I mentioned in my comment, some databases have limits on the number of items you can put in the IN clause. I'm not familiar enough with MySQL to know what that is, but I'd be willing to bet it's large enough that you could do this in only a handful of queries.

And in many cases this will be less efficient (slower) than having the IDs in a table in the database and doing a join, but sometimes people don't have the access to the database required to accomplish that.

joran
  • 169,992
  • 32
  • 429
  • 468
  • @quantactuary Out of curiosity, did it accept all 50k in one IN clause, or did you have to split it up? – joran Oct 28 '15 at 23:53
  • Good question - it ran all at once, literally in seconds. – quant actuary Oct 29 '15 at 21:47
  • @joran How can the same thing be done with string type values rather than numeric? – Vijay Barve Sep 18 '17 at 14:51
  • 1
    @VijayBarve e.g. `paste(paste0("'",letters[1:3],"'"),collapse = ",")` just to add the single quotes; again with the usual warnings that this sort of thing should only be done if your db setup is such that SQL injection is not much of a concern. – joran Sep 18 '17 at 15:04