0

Please have a look at the simple script at the end of the post. I have a database containing two tables which I combine using union_all. Is there a way to add the result to the database without collecting the data i.e. loading them into memory? Many thanks!




library(tidyverse)
library(DBI) # main DB interface
library(dbplyr) # dplyr back-end for DBs
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(RSQLite)


##create the databases

df1 <- tibble(x=1:20,y=rep(c("a", "b"), 10))


df2 <- tibble(x=101:120,y=rep(c("d", "e"), 10))




con <- dbConnect(drv=RSQLite::SQLite(), dbname="db.sqlite")

dbWriteTable(con,"mydata1",df1, overwrite=T)
dbWriteTable(con,"mydata2",df2, overwrite=T)

dbDisconnect(con) # closes our DB connection


con <- dbConnect(drv=RSQLite::SQLite(), dbname="db.sqlite")

mydb1 <- tbl(con, "mydata1")
mydb2 <- tbl(con, "mydata2")


mydb12 <- union_all(mydb1,mydb2)

#is there a way to add the union of mydb1 and mydb2 to the database without explicitly collecting the data?

Created on 2020-12-24 by the reprex package (v0.3.0)

larry77
  • 1,309
  • 14
  • 29

1 Answers1

2

Since you're dealing with SQL, just use SQL.

collect(mydb1) %>%
  nrow()
# [1] 20
DBI::dbExecute(con, "insert into mydata1 select * from mydata2")
# [1] 20
collect(mydb1) %>%
  nrow()
# [1] 40
collect(mydb1) %>%
  tail()
# # A tibble: 6 x 2
#       x y    
#   <int> <chr>
# 1   115 d    
# 2   116 e    
# 3   117 d    
# 4   118 e    
# 5   119 d    
# 6   120 e    

If you want the combined data in a new table, then here's an alternative.

DBI::dbExecute(con, "
  create table mydata12 as
    select * from mydata2 union all select * from mydata1")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks. About your suggestion to create a new table with the combined data (it is what I am really after): does it require to load the data in memory? I will be working with large tables and I absolutely need to avoid that. I believe I should be safe since it is sql that we are talking about, but can you confirm that? – larry77 Dec 24 '20 at 22:54
  • It does not load the data into R; the only data returned to R on the first `dbExecute` command is `20`, an integer, the number of rows affected by that command; in the second (create) call, it returns an integer again (though not the number of rows returned). Neither returns a frame, neither loads the combined dataset into R. – r2evans Dec 25 '20 at 01:17
  • I can't help but wonder, are you really dealing with "large tables" in SQLite? While I know it can deal with large data, there may be a point where it would make sense for you to work with more of a clustered DBMS. – r2evans Dec 25 '20 at 01:19
  • Well, maybe gigabytes if data is.not big data these.days, bit it is for.my.seasoned workstation – larry77 Dec 25 '20 at 08:34
  • The discussion of DBMS/size aside, does this work for you? – r2evans Dec 25 '20 at 14:36
  • 1
    Thanks! I was hoping for a pure dbplyr solution, but yours works and reminds me it is time to learn some sql for real. – larry77 Dec 25 '20 at 20:43
  • Yes, I understood that, but unfortunately I don't know if/how to do that in `dbplyr`-native verbs. – r2evans Dec 25 '20 at 21:01