I have a dataset called df
where I have year, month and day variables. I would like to use the write_dataset
function to output a folder with the standard arrow dataset syntax as in the following image:
Within each folder there will be month=1, month=2, and so on.
Now, in order to create this I have used the following code:
df <- df %>% group_by(year, month, day)
output_folder = "my/path"
arrow::write_dataset(df,
output_folder,
format = "parquet",
)
However, my dataset size is too big, and I would like to use data.table
to take advantage of fast grouping. My approach to do the same has been the following:
grouping_cols = c("year", "month", "day")
setkeyv(df, grouping_cols)
arrow::write_dataset(df,
output_folder,
format = "parquet",
)
However, now the result is not grouped and a single .parquet file is returned (not fully utilizing the potential of arrow::write_dataset
).
Is there any way to have the same dataset grouped by specified columns but based on data.table
instead of dplyr
groupings?