2

I have a large dataframe containing multiple variables (categorical and numerical), which I want to split into smaller ones such that each smaller file that I export (as CSV) is not larger than 1 MB. I would also like to do this in the tidy style, without looping.

Here's an example:

df <- data.frame(matrix(runif(42000*11), ncol = 11)) %>% 
  mutate(ID = rep(LETTERS, length.out = 42000)) %>% 
  arrange(ID)

write_csv(df, "df.csv") # 9.0 MB file

Here, I would ideally split df into 9 or 10 different files, each not more than 1MB. Also, since there are grouping variables like ID, the split should not happen in the middle of a group.

I came across split() and group_split() in other threads here on SO, but none of them mention splitting based on file size. I guess this will require a custom function? Or maybe there is some rows:filesize ratio that I can use to decide the maximum number of rows any smaller dataframe should have and proceed that way?

  • 1
    There are no built in functions that will tell you how large a data.frame will be when written to a CSV file. You'd have to estimate based on the number of columns and how long average string values are and how many digits are required to maintain precision for numeric values. CSV files are just text so everything has to be recoded into a string. Are you able to make strong assumptions about your data? Or do you really need to infer everything from the data frame itself? How close to 1MB does each chunk have to be? – MrFlick Feb 15 '22 at 06:19
  • 1
    This post might be helpful - https://stackoverflow.com/questions/35917731/r-how-to-estimate-csv-file-size-prior-to-writing-it-to-disk – Ritchie Sacramento Feb 15 '22 at 06:37

0 Answers0