3

I have multiple huge CSV files that I have to export based on Apache Parquet format and split them into smaller files based on multiple criteria/keys (= column values). As I understand Apache arrow is the R package allowing to work with Apache parquet files.

I work in a shared lab environment and given the limited RAM memory (compared to the number of users who work simultaneously in this same environment) we are advised to create our dataframes in local SQLite databases rather than importing them in-memory (into RAM).

The following pseudo-code shows how I import my CSV files in my local SQLite database. In the following code I use sqldf and tidyverse packages.

input_file_path <- "D:/tmp/mydata.csv"
db_file_path <- "D:/tmp/db_tmp_sqlite.db"
unlink(db_file_path)
sqldf(str_c("attach '", db_file_path, "' as new"))
sqldf(read.csv.sql(
    file = input_file_path,
    sql = "
        create table mytable as
        select
            . . .
        from
            file
    ",
    `field.types` = list(
      . . .
    ),
    ##
    header = TRUE,
    sep = ",",
    eol = "\n",
    dbname = db_file_path,
    drv = "SQLite"
))

This works well as expected, my table is created and I can run all required SQL queries, in particular adding supplementary variables (columns in my tables) which will be used later as keys to export my tables into Apache Parquet format. However, based on Apache Arrow for R Cheatsheet, the function write_dataset that allows to export my data based on Apache Parquet format, requires a dataframe.

And that is precisely my problem because a dataframe in R is in-memory whereas my data as I explained earlier are in a SQLite local database. This means that first I have to do a SELECT to export the whole data into RAM, something like

df <- sqldf("select * from mytable", dbname = ...)

And only then I'd be able to use write_dataset with the created df dataframe as its first argument in order to export and split my data based on Apache Parquet format. But this is not what I wanted to do. The whole point was to put the data in SQLite and not in-memory (RAM) given the existing resource limitations (lack of memory) in our shared environment.

Is there anyway to convert to Apache Parquet directly from SQLite within a R program, without first putting the whole data in a dataframe before the export, or I'm trying to do something which is simply not possible?

user17911
  • 1,073
  • 1
  • 8
  • 18

1 Answers1

1

DuckDB has several great features, including the ability to both import and export CSV and parquet formats natively without affecting R memory.

TL;DR

con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")

And that is all. The data is never imported into R. (Now, whether duckdb can do it itself without exhausting memory is another issue I did not validate locally ...)

Caveat emptor: before you trust this blindly, however, I strongly urge you to do some validation of classes. Most of it can be done easily in a "lazy" fashion using duckdb without having to load the whole frame into R. I encourage you to read more of its documentation for querying CSV/parquet files natively (without loading into R).

Methodology

To make a comparison of the two methods (via a data.frame which you don't want to do, and via duckdb), we'll use "RSS" (from ps::ps_memory_info()) to indicate the current R process memory usage. From ?ps::ps_memory_info:

        * 'rss': "Resident Set Size", this is the non-swapped physical
          memory a process has used (bytes). On UNIX it matches "top"‘s
          'RES' column (see doc). On Windows this is an alias for
          'wset' field and it matches "Memory" column of 'taskmgr.exe'.

Though an imperfect measure of the true impact to R, it does indicate a significantly smaller impact on R when using DuckDB.

Also, each method is done in a fresh instance of R --vanilla. No .Rprofile or site-init files are loaded. The code you see is the code that is executed, nothing more.

In R via data.frame

Sys.getpid()
# [1] 20860

file.info("quux.csv")["size"] / (1024^2) # MBs
#              size
# quux.csv 299.3079
mem1 <- ps::ps_memory_info()["rss"]
dat <- read.csv("quux.csv")
mem2 <- ps::ps_memory_info()["rss"]
arrow::write_parquet(dat, "quux1.pq")
mem3 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, mem3, diff = mem3 - mem1) / (1024^2)
#        rss        rss        rss   diff.rss 
#   57.70703 1218.55859 1548.54688 1490.83984 

This indicates R is 1490MB larger after reading in the full data. (FYI, data.table::fread instead of read.csv results in only 408MB of memory gain, same austere conditions. I'm not trying to optimize this part, though :-)

(FYI, these numbers vary for me from run-to-run and are likely to be different based on other factors outside the scope of this answer. My laptop has 64GB of RAM, it might not be comparable to exactly what you see.)

DuckDB, read from CSV, write to parquet

Sys.getpid()
# [1] 32485

mem1 <- ps::ps_memory_info()["rss"]
con <- DBI::dbConnect(duckdb::duckdb(), dbdir = ":memory:")
DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv')) to 'quux2.pq' (format parquet)")
# [1] 1000207
mem2 <- ps::ps_memory_info()["rss"]
c(mem1, mem2, diff=mem2 - mem1) / (1024^2)
#      rss      rss diff.rss 
# 63.23828 86.35938 23.12109 

showing only 23MB in this process.

Comparing the resulting files.

file.info(list.files(pattern = "quux.*"))["size"] /  (1024^2)
#               size
# quux.csv 299.30786
# quux1.pq  51.69008
# quux2.pq  66.84857

The larger file is due to the differences in class noted below. My guess is that if we force some of the character columns to be logical, then its file-size might be reduced.

A little more in-depth look at the contents:

ds1 <- arrow::open_dataset("quux1.pq")
ds2 <- arrow::open_dataset("quux2.pq")
identical(names(ds1), names(ds2))
# [1] TRUE

data.frame(
  ds1 = sapply(head(ds1, 1), function(z) class(z)[1]),
  ds2 = sapply(head(ds2, 1), function(z) class(z)[1])
)
#           ds1       ds2
# V1  character character
# V2    integer   integer
# V3  character character
# V4    integer   integer
# V5    logical character
# V6    integer   integer
# V7  character   POSIXct
# V8    logical character
# V9    numeric   numeric
# V10   numeric   numeric
# V11   numeric   integer
# V12   integer   integer
# V13   integer   integer
# V14   integer   integer
# V15   numeric   numeric
# V16   integer   integer
# V17   integer   integer
# V18   numeric   numeric
# V19   numeric   numeric
# V20   logical character
# V21   numeric   numeric
# V22   numeric   numeric
# V23   numeric   numeric
# V24   integer   integer
# V25   logical character
# V26   integer   integer
# V27   integer   integer
# V28   integer   integer
# V29   integer   integer
# V30   logical character
# V31   logical character
# V32   numeric   numeric
# V33   logical character
# V34   logical character
# V35   logical character
# V36   logical character
# V37   logical character
# V38   logical character
# V39 character   POSIXct
# V40   logical character
# V41   logical character
# V42   numeric   integer
# V43   logical character
# V44   logical character
# V45   logical character
# V46   logical character
# V47   numeric   numeric
# V48   logical character
# V49   logical character
# V50   logical character
# V51   logical character
# V52   logical character
# V53   logical character
# V54   logical character
# V55   logical character
# V56   logical character
# V57   logical character

Some interesting things to deduce from this:

  • two fields are timestamps, and the duckdb method correctly identified them, parsed them, and stored as numeric timestamps; since I didn't tell R the column classes explicitly, it defaulted to character for them;
  • all of the columns that are logical in ds1 and character in ds2 are all null (sorry, it was the data I had); the fact that they are different classes indicate that duckdb defaults to string-like nulls instead of "bit", may or may not be a factor for you;
  • only two columns were classified as numeric-vs-integer; V11 was truly integer, it's fine; the second one, V42 shows that the heuristic used for differentiating between numeric and integer missed something. The first row of V42 that contained any fractional component was on row 37159.

Fixing data discrepancies

Column V42 indicates that we need to be very cognizant of what is going in and out of that parquet generator. My guess is that it's in the "CSV Import" step, so looking at CSV Loading suggests the need to change the SAMPLE_SIZE. While relatively inefficient, I'll use -1 indicating that it needs to look at all values in a column to determine its class. Slower, yes, but also safer.

Validation of this assumption:

> str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv') limit 5")[c("V11","V42")])
'data.frame':   5 obs. of  2 variables:
 $ V11: int  4407 4408 4408 4407 4408
 $ V42: int  26 25 23 21 3
> str(DBI::dbGetQuery(con, "select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1) limit 5")[c("V11","V42")])
'data.frame':   5 obs. of  2 variables:
 $ V11: int  4407 4408 4408 4407 4408
 $ V42: num  26 25 23 21 3

Indeed, V11 is still good, and V42 changes from int to num.

After rerunning with this new parameter,

DBI::dbExecute(con, "copy (select * from read_csv_auto('quux.csv', SAMPLE_SIZE=-1)) to 'quux3.pq' (format parquet)")

offline validation confirmed that all values are correct.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you very much for your time and your help for such a nice, detailed and complete presentation. I have never tried DuckDB so far, though while I was Googling about this topic, I saw multiple pages suggesting this database for Parquet import/export. Obviously, from tomorrow I'm going to read its documentation and inspired by your presentation I will change my code accordingly. Thank you very much for your time. – user17911 Feb 08 '23 at 22:05
  • I'm checking the documentation of the COPY command which you suggested here: https://duckdb.org/docs/sql/statements/copy.html. Reading this page, there is something which seems to me to be missing. There is no way to specify keys (specific variables) based on which data is split while building Parquet structure. The arrow package in R, contains the 'write_dataset' function with two parameters : First of all I set hive_style = TRUE which then allows me to specify the desired keys as vector of string values (variable names) of the 'partitionning' parameter. Does Duckdb allows such thing? – user17911 Feb 10 '23 at 19:31
  • 1
    I looked for that when I wrote the answer, and didn't find something that was clear. I've tested various options that can be used with duckdb's `scan_parquet` (e.g., `HIVE_PARTITIONING=1`), but that does not appear to work when exporting data. I think you have two options: (1) create the hive structure manually, and create files iteratively within the `SELECT ...` clause; (2) save to the monolithic parquet file, then using R you can iteratively load the _subset_ (one group at a time) and then use `write_parquet` to save using its native hive-partitioning. Not great, perhaps a feature-request? – r2evans Feb 10 '23 at 19:44
  • What I find strange is that many examples in Arrow documentation are about "Manipulate Larger-than-Memory Datasets" which use collect() method of dplyr to read huge amount of data in streaming mode. Apparently nobody has ever thought about writing in streaming mode. – user17911 Feb 10 '23 at 22:49
  • The parquet format is immutable: once a file is saved, it cannot be appended. However, `arrow::open_dataset` can read a directory (nested, even) of parquet files, coming them into one virtual data source. So while one cannot append to a file, one can add another file and have it appear as if appended. The only downside is that doing so with hundreds of files slows it down a bit. – r2evans Feb 11 '23 at 03:27
  • May seem rather a strange question, but is it really mandatory to create a multi-file dataset? The fact that I went this way was because of the examples provided in the documentation stating that multi-file was the way to go when the size is bigger than the RAM. But if the collect() method of dlpyr allows to control that only what is required is read and put in the data.frame, does it really matter that the dataset is a single file or multi-file? If upon reading the user can specify partitioning as conditions on a single dataset and the result is the same, why bother with multi-file dataset? – user17911 Feb 11 '23 at 12:41
  • 1
    It's never mandatory to do a multi-file dataset. It comes in handy when the data is not final, and at some point you get a few more rows. The options are to read in all data, concatenate, and save to only one file (removing the previous), or to save the new data to a new file and put it in the same directory. Many use-cases (i.e., stable, unchanging data) will not benefit from multi-files datasets. – r2evans Feb 11 '23 at 14:35
  • 1
    Good news! Apparently hive-style is going to be available in the next release : https://github.com/duckdb/duckdb/pull/5964 – user17911 Feb 12 '23 at 13:55
  • 1
    After a week of unsuccessful tests with the new version of DuckDB supporting Hive style, I still have lack of memory problems. Consequently, I think I have to try what you suggested: "... create the hive structure manually, and create files iteratively within the SELECT ... clause ...". So in your opinion, if I create the nested directory structure iteratively inside loops by the same naming convention (key=value) and by putting within each directory only a single Parquet, will this work the same way for the user as any other multi-file dataset? (because I thought directories were dependent) – user17911 Feb 18 '23 at 19:44
  • yes, that should work – r2evans Feb 18 '23 at 23:09