8

I'm doing some analysis on a large volume of data stored in a PostgreSQL database. For speed and memory reasons I'm using the data.table package. Currently I'm doing this to read the data.

library(RPostgreSQL)
library(data.table)
...
query <- "SELECT * FROM eqtl"
data <- as.data.table(dbGetQuery(con, query))

I'm wondering if there is a better way to do this that doesn't involve reading the whole thing into a data.frame and then copying it into a data.table.

rmccloskey
  • 482
  • 5
  • 14
  • 6
    Perhaps not an answer you're looking for (?), but if you've already read it as a `data.frame`, you can use `setDT()` to convert it to `data.table` *by reference* (the conversion should be instant from data.frame to data.table) - i.e., `data <- dbGetQuery(...); setDT(data)`. – Arun Feb 06 '15 at 21:43
  • @Arun thanks, this is probably what I'll end up doing, although I would have liked to remove `data.frame`s from the equation altogether. – rmccloskey Feb 06 '15 at 22:12

1 Answers1

2

As Arun pointed out in the comment you can just use setDT on dbGetQuery results.

Additionally there is a helper function available in my package dwtools which extends this feature for auto setkey when needed. This was designed to be useful when chaining. It also unifies the interface to other database vendors so you can chain data.table using different databases.
The simple select usage will looks like:

my_dt = db("SELECT * FROM eqtl")
# to setkey use
db("SELECT * FROM eqtl", key="mykeycol")

Heavily extended example from package manual:

jj_aggr = quote(list(amount=sum(amount), value=sum(value)))
r <- db("sales",key="geog_code" # read fact table from db
        )[,eval(jj_aggr),keyby=c("geog_code","time_code") # aggr by geog_code and time_code
          ][,db(.SD) # write to db, auto.table.name
            ][,db("geography",key="geog_code" # read lookup geography dim from db
                  )[.SD # left join geography
                    ][,eval(jj_aggr), keyby=c("time_code","geog_region_name")] # aggr
              ][,db(.SD) # write to db, auto.table.name
                ][,db("time",key="time_code" # read lookup time dim from db
                      )[.SD # left join time
                        ][, eval(jj_aggr), keyby=c("geog_region_name","time_month_code","time_month_name")] # aggr
                  ][,db(.SD) # write to db, auto.table.name
                    ]

It would read data from multiple databases, joins, aggregates, save intermediate results to multiple databases.

jangorecki
  • 16,384
  • 4
  • 79
  • 160