Split dataset per rows into smaller files in R

Question

I am analyzing a dataset, with 1.14 GB (1,232,705,653 bytes).

When reading the data in R:

trade = read.csv("commodity_trade_statistics_data.csv")

One can see that it has 8225871 instances and 10 attributes.

As I intend to analyze the dataset through a Data Wrangling web app that has a limit on the imports of 100MB, I am wondering how can I split the data into files with a max of 100MB?

The split that I intend to do is per rows and each file should contain the header.

score 6 · Accepted Answer · edited Jul 15 '19 at 23:11

Split up the dataframe into the desired number of chunks. Here is an example with the built-in mtcars dataset:

no_of_chunks <- 5

f <- ceiling(1:nrow(mtcars) / nrow(mtcars) * 5)

res <- split(mtcars, f)

You can then save the result back as csv using purrr:

library(purrr)
map2(res, paste0("chunk_", names(res), ".csv"), write.csv)

Edit: In the context of my question, the following script solved the problem:

trade = read.csv("commodity_trade_statistics_data.csv")

no_of_chunks <- 14

f <- ceiling(1:nrow(trade) / nrow(trade) * 14)

res <- split(trade, f)

library(purrr)
map2(res, paste0("chunk_", names(res), ".csv"), write.csv)

score 1 · Answer 2 · edited Jul 17 '19 at 14:27

Alternatively with tydiverse, use readr::read_csv_chunked and readr::write_csv, like:

require(tidyverse)

inputfile = "./commodity_trade_statistics_data.csv"

chunck.size = 100 * 1000 * 5

proc.chunk = function(df, pos) {

  df %>% 
    mutate(row_idx=seq(pos, length.out = df %>% nrow)) %>%  # optional, preserve row number within original csv
    write_csv(
      paste0("chunk_", floor(pos / chunck.size), ".csv")
    )

}

read_csv_chunked(
  inputfile, 
  callback=SideEffectChunkCallback$new(proc.chunk), 
  chunk_size = chunck.size,
  progress = T    # optional, show a progress bar
)

Within readr::read_csv_chunked, one can also set parameter local with readr::locale, which allows to define format of datetime and timezone.

To "lazify" the estimation of chunck.size, I wrote a crude function based on simple statistics:

calc.chunk.size = function(
                  file,
                  limit.mb.per.file = 100,  # limit MB per file
                  floor.to.nrow=10000,      # floor to nrows per file for easier numbers
                  size.estimate.sigma = 5,  # assume row.size normally distr. and first n row representative
                                            # 3-sigma left 0.1% chance of having file exceeding limit size
                                            # 5-sigma is, closer to Six Sigma :)
                  sample.first.nrow = 100L
                  ) {

  # requires package "R.utils"
  tot.nrow = R.utils::countLines(file)

  # alternatively if on POSIX machine (Linux or macOS)
  # tot.nrow = system("wc -l ./commodity_trade_statistics_data.csv", intern=T) %>% 
  #            str_extract("\\d+") %>%
  #            as.numeric()

  bytes.per.row = read_lines(file, 
                             skip=1,   # skip header, or the 1st row anyway ...
                             n_max=sample.first.nrow # take sample of at most first N row 
                             ) %>% 
                   nchar(type="bytes") %>% { list(mean=mean(.), sd=sd(.)) }

  est.bytes.per.row = bytes.per.row$mean + size.estimate.sigma * bytes.per.row$sd

  est.chunck.size = limit.mb.per.file * 1000 * 1000 / est.bytes.per.row

  est.chunck.size = max(c(1, floor(est.chunck.size / floor.to.nrow))) * floor.to.nrow

  est.chunck.size

}


chunck.size = calc.chunk.size(inputfile)
# chunk.size = 540000 for this particular file

Split dataset per rows into smaller files in R

2 Answers2