I have a large dataframe containing multiple variables (categorical and numerical), which I want to split into smaller ones such that each smaller file that I export (as CSV) is not larger than 1 MB. I would also like to do this in the tidy style, without looping.
Here's an example:
df <- data.frame(matrix(runif(42000*11), ncol = 11)) %>%
mutate(ID = rep(LETTERS, length.out = 42000)) %>%
arrange(ID)
write_csv(df, "df.csv") # 9.0 MB file
Here, I would ideally split df
into 9 or 10 different files, each not more than 1MB. Also, since there are grouping variables like ID
, the split should not happen in the middle of a group.
I came across split()
and group_split()
in other threads here on SO, but none of them mention splitting based on file size. I guess this will require a custom function? Or maybe there is some rows:filesize ratio that I can use to decide the maximum number of rows any smaller dataframe should have and proceed that way?