0

I have several (8) large files (1M rows each) with the same variables/format saved individually by year. I would like to save to a single table using the duckdb database format in R. The duck_read_csv() command does this nicely.

The problem: there is no variable indicating "year" using this method, so the trend for repeated measurements is lost. My DBI and SQL knowledge is limited, so there's probably an easy way to do this, but I'm stuck here (simple example demonstrating issue below):

library(duckdb)
con <- dbConnect(duckdb())

# example year 1 data
data <- data.frame(id = 1:3, b = letters[1:3]) # year = 1
path <- tempfile(fileext = ".csv")
write.csv(data, path, row.names = FALSE)

# example year 2 data
data2 <- data.frame(id = 1:3, b = letters[4:6]) # year = 2
path2 <- tempfile(fileext = ".csv")
write.csv(data2, path2, row.names = FALSE)

duckdb_read_csv(con, "data", files = c(path, path2)) # data is appended

which yields the following- data is appended by row but i need a variable to indicate "year":

dbReadTable(con, "data")
  id b
1  1 a
2  2 b
3  3 c
4  1 d
5  2 e
6  3 f

Is there a way to create a new variable during this process, or is it better to create a table for

Matt L.
  • 2,753
  • 13
  • 22

2 Answers2

3

I modified answer by Jon to solve my issue (thanks Jon). Files were too large to read into memory all at once (~20M rows each), so I iterated over them one at a time to transfer into the database (~198M rows total).

I also extracted the "year" from the filename using gsub into a separate variable yrs.

library(dplyr)
library(duckdb)
con <- dbConnect(duckdb())

# example year 1 data
data <- data.frame(id = 1:3, b = letters[1:3]) # year = 1
path1 <- tempfile(fileext = ".csv")
write.csv(data, path1, row.names = FALSE)

# example year 2 data
data2 <- data.frame(id = 1:3, b = letters[4:6]) # year = 2
path2 <- tempfile(fileext = ".csv")
write.csv(data2, path2, row.names = FALSE)

paths <- c(path1, path2)
yrs <- c("year1", "year2")

purrr::walk2(paths, yrs,
      ~{
        readr::read_csv(.x, n_max = Inf) %>% # read 1 file into memory at a time
          mutate(year = .y) %>% 
          rename_all(. %>% tolower()) %>%    # i really prefer lower case variable names
          duckdb::dbWriteTable(conn = con, name =  "data", value = .,  append =T) 
      })


Matt L.
  • 2,753
  • 13
  • 22
  • 1
    I have submitted a feature request that may relate to this: https://github.com/duckdb/duckdb/discussions/7561 I think passing `id = "file"` to my proposed function (and in turn to `readr::read_delim`) might allow you to get what you need. – Ian Gow May 17 '23 at 17:29
1

One approach could be to use purrr::map_dfr + readr::read_csv for the reading, which allows you to assign an "id" column based on names assigned to the file paths, and then register that as a duckdb table:

library(dplyr)
purrr::map_dfr(c(year01 = path, 
                 year02 = path2),
               ~readr::read_csv(.x), 
               .id = "year") %>%
  duckdb_register(con, "data", .)

Result

tbl(con, "data") %>%
  collect()
# A tibble: 6 × 3
  year      id b    
  <chr>  <dbl> <chr>
1 year01     1 a    
2 year01     2 b    
3 year01     3 c    
4 year02     1 d    
5 year02     2 e    
6 year02     3 f 
Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • if i understand the `duckdb_read_csv` function correctly, it writes directly to disk, which may be important for my real data which is much larger. I'm concerned that introducing the `readr` step (which is what I typically use) may be too slow...i'll give it a try though! Also looks like this this solution doesn't save to disk, just creates a virtual table. – Matt L. Sep 07 '22 at 05:28
  • I'm not sure. Potentially another path could be to add an intermediate step where you augment each CSV with a column denoting its source year, and then use your original process. – Jon Spring Sep 07 '22 at 16:27
  • Thanks- the `readr` method you suggest will work with the data size that I'm using (8 files with ~1M rows each), but I'm now running into an unrelated issue, and have appended a `duckdb` issue on github (will not write for nrow > ~150k using `duckdb::dbWriteTable()` instead of `duckdb_register()` in order to write to disk instead of virtual/memory ) – Matt L. Sep 07 '22 at 16:58
  • The error was due to unrecognized latin characters/unicode in the data file. I was able to load using a modified version of your answer,, mapping over large files (>20M rows each). Will post that code for others. – Matt L. Sep 16 '22 at 16:57