2

I have a dataset called df where I have year, month and day variables. I would like to use the write_dataset function to output a folder with the standard arrow dataset syntax as in the following image:

enter image description here

Within each folder there will be month=1, month=2, and so on.

Now, in order to create this I have used the following code:

df <- df %>% group_by(year, month, day)
output_folder = "my/path"
arrow::write_dataset(df, 
                     output_folder, 
                     format = "parquet", 
                     )

However, my dataset size is too big, and I would like to use data.table to take advantage of fast grouping. My approach to do the same has been the following:

grouping_cols = c("year", "month", "day")
setkeyv(df, grouping_cols)

arrow::write_dataset(df, 
                     output_folder, 
                     format = "parquet", 
                     )

However, now the result is not grouped and a single .parquet file is returned (not fully utilizing the potential of arrow::write_dataset).

enter image description here

Is there any way to have the same dataset grouped by specified columns but based on data.table instead of dplyr groupings?

  • 1
    FYI I removed the [tag:arrow-functions] tag because it's meant to refer to a javascript syntax – camille Apr 05 '23 at 16:00
  • Can you elaborate on "my dataset size is too big"? Are you getting an error, is it taking longer than you’d like, something else? – zephryl Apr 05 '23 at 16:09
  • 1
    FYI, the title's `data.table grouping` is very different from `group_by(year, month, day)`. Grouping applied by `data.table` tends to be ephemeral (each `[`-calc using its `by=` argument), and is not stored in the data.table object itself. This is in contrast to `group_by` which attaches an attribute to the frame that follow-on verbs can use. You can perhaps define `grouping_cols` dynamically by using `intersect(names(attr(df, "groups")), names(df))` – r2evans Apr 05 '23 at 16:36

1 Answers1

6

If you look at the docs the default partitioning parameter is whatever the dataset's dplyr::group_vars are. That concept isn't automatically translated into the data.table analog so you have to supply that parameter if you're not using a dplyr object as the input.

arrow::write_dataset(df, 
                    output_folder,
                    partitioning=grouping_cols,
                    format = "parquet", 
                    )
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72