I have the following 2 dataframes:
library(tidyverse)
library(RSQLite)
df1 <- data.frame(user_id=c("A","B","C"),
transaction_date=c("2019-01-01","2019-01-01","2019-01-01"))
df2 <- data.frame(user_id=c("C","D","E"),
transaction_date=c("2019-01-03","2019-01-03","2019-01-03"))
df1
df2
# user_id transaction_date
# <fct> <fct>
# A 2019-01-01
# B 2019-01-01
# C 2019-01-01
# user_id transaction_date
# <fct> <fct>
# C 2019-01-03
# D 2019-01-03
# E 2019-01-03
I would like to find the minimum transaction date for each user_id. I could do it like so:
rbind(df1, df2) %>%
group_by(user_id) %>%
summarise(min_dt=min(transaction_date %>% as.Date()))
# user_id min_dt
# <fct> <date>
# A 2019-01-01
# B 2019-01-01
# C 2019-01-01
# D 2019-01-03
# E 2019-01-03
The problem is that I have 100s of dataframes (1 per day) and millions of rows per dataframe. And the list of user_id's grows each time I introduce new user_id's and calculate the min_dt. So the whole process becomes very slow over time. Questions: 1) Would it be faster to run the computations in SQLite? 2) If so how can I accomplish this without downloading the data locally each time?
Here's what I've tried.
Step 1: Create database from df1:
db <- dbConnect(SQLite(), dbname = "user_db.sqlite")
dbWriteTable(conn = db, name = "first_appeared", value = df1, append=TRUE)
tbl(db, "first_appeared")
## Source: table<first_appeared> [?? x 2]
## Database: sqlite 3.29.0 [user_db.sqlite]
# user_id transaction_date
# <chr> <chr>
# 1 A 2019-01-01
# 2 B 2019-01-01
# 3 C 2019-01-01
Step 2: Append df2:
dbWriteTable(conn = db, name = "first_appeared", value = df2, append=TRUE)
tbl(db, "first_appeared")
## Source: table<first_appeared> [?? x 2]
## Database: sqlite 3.29.0 [/Volumes/GoogleDrive/My Drive/Ad hoc/201908 v2
# mapper/user_db.sqlite]
# user_id transaction_date
# <chr> <chr>
# 1 A 2019-01-01
# 2 B 2019-01-01
# 3 C 2019-01-01
# 4 C 2019-01-03
# 5 D 2019-01-03
# 6 E 2019-01-03
Step 3: Calculate min_dt in SQLite
tbl(db, "first_appeared") %>%
group_by(user_id) %>%
summarise(first_appeared=min(transaction_date))
dbDisconnect(db) # Close connection
## Source: lazy query [?? x 2]
## Database: sqlite 3.29.0 [/Volumes/GoogleDrive/My Drive/Ad hoc/201908 v2
## mapper/user_db.sqlite]
# user_id first_appeared
# <chr> <chr>
# 1 A 2019-01-01
# 2 B 2019-01-01
# 3 C 2019-01-01
# 4 D 2019-01-03
# 5 E 2019-01-03
Step 4: How do I transfer these results directly to the database (overwrite the database) without first downloading the data locally?