4

The table reg_data is a PostgreSQL table. It turns out to be faster to run the regressions in PostgreSQL. But, as I am running it for 100,000s of data sets, I want to do it data set by data set and append the results of each to a table.

Is there a way to append PostgreSQL data to a PostgreSQL table using native dplyr verbs? I'm not sure that there's a huge cost to bringing the data to R then sending them back to PostgreSQL (it's just 6 numbers and a couple of identifying fields), but it does seem inelegant.

library(dplyr)

pg <- src_postgres()

reg_data <- tbl(pg, "reg_data")

reg_results <-
    reg_data %>%
    summarize(r_squared=regr_r2(y, x),
              num_obs=regr_count(y, x),
              constant=regr_intercept(y, x),
              slope=regr_slope(y, x),
              mean_analyst_fog=regr_avgx(y, x),
              mean_manager_fog=regr_avgy(y, x)) %>%
    collect() %>%
    as.data.frame()

# Push to database.
dbWriteTable(pg$con, c("bgt", "within_call_data"), reg_results,
             append=TRUE, row.names=FALSE)
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
Ian Gow
  • 3,098
  • 1
  • 25
  • 31

2 Answers2

4

dplyr does not include commands to insert or update records in a database, so there is not a complete native dplyr solution for this. But you could combine dplyr with regular SQL statements to avoid bringing the data to R.

Let's start by reproducing your steps before the collect() statement

library(dplyr)

pg <- src_postgres()

reg_data <- tbl(pg, "reg_data")

reg_results <-
    reg_data %>%
    summarize(r_squared=regr_r2(y, x),
              num_obs=regr_count(y, x),
              constant=regr_intercept(y, x),
              slope=regr_slope(y, x),
              mean_analyst_fog=regr_avgx(y, x),
              mean_manager_fog=regr_avgy(y, x))

Now, you could use compute() instead of collect() to create a temporary table in the database.

temp.table.name <- paste0(sample(letters, 10, replace = TRUE), collapse = "")

reg_results <- reg_results %>% compute(name=temp.table.name)

Where temp.table.name is a random table name. Using the option name = temp.table.name in compute we assign this random name to the temporary table created.

Now, we will use the library RPostgreSQL to create an insert query that uses the results stored in the temporary table. As the temporary table only lives in the connection created by src_postgresql() we need to reuse it.

library(RPostgreSQL)
copyconn <- pg$con
class(copyconn) <- "PostgreSQLConnection" # I get an error if I don't fix the class

Finally the insert query

sql <- paste0("INSERT INTO destination_table SELECT * FROM ", temp.tbl.name,";")

dbSendQuery(copyconn, sql)

So, everything is happening in the database and the data is not brought into R.

EDIT

Previous versions of this post did break encapsulation when we obtained temp.tbl.name from reg_results. This is avoided using the option name=in compute.

creativename
  • 398
  • 2
  • 15
Belethia
  • 368
  • 2
  • 9
3

another option would be to use a command called sql_render() to create each SQL statement, and then another command called db_save_query() to create the table using a SQL statement and then a manual statement to append to the table. To loop through each query, the purrr commands: map and walk are used. Preferably, a command like compute() command should do this, but in lieu of that, the following is a fully reproducible example:

library(dplyr)
library(dbplyr)
library(purrr)

# Setting up a SQLite db with 3 tables
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, filter(mtcars, cyl == 4), "mtcars1")
copy_to(con, filter(mtcars, cyl == 6), "mtcars2")
copy_to(con, filter(mtcars, cyl == 8), "mtcars3")



# Pre-process the SQL statements
tables <- c("mtcars1","mtcars2","mtcars3")
all_results <- tables %>%
  map(~{
    tbl(con, .x) %>%
      summarise(avg_mpg = mean(mpg),
                records = n()) %>%
      sql_render() 
  })

# Execute the SQL statements, 1st one creates the table
# subsquent queries are insterted to the table
first_query <- TRUE
all_results %>%
  walk(~{
    if(first_query == TRUE){
      first_query <<- FALSE
      db_save_query(con, ., "results")
    } else {
      dbExecute(con, build_sql("INSERT INTO results ", .))
    }
  })


tbl(con, "results")

dbDisconnect(con)
edgararuiz
  • 625
  • 5
  • 9
  • It would be nice if `dbWriteTable(pg, "results", ., append = TRUE)` when `.` represents a server-side table or query, simply called either `db_save_query(con, ., "results")` or `dbExecute(con, build_sql("INSERT INTO results ", .))` as applicable. I can see work flows evolving like this: `df %>% collect() %>% some_processing() %>% dbWriteTable()` to this `df %>% some_processing() %>% collect() %>% dbWriteTable()` where it would be nice to omit the `collect()` in the final version. – Ian Gow Apr 05 '21 at 12:14