Large file processing - error using chunked::read_csv_chunked with dplyr::filter

Question

When using the function chunked::read_csv_chunked and dplyr::filter in a pipe, I get an error every time the filter returns an empty dataset on any of the chunks. In other words, this occurs when all the rows from a given chunk of the dataset are filtered out.

Here is a modified example, drawn from the package chunked help file:

library(chunked); library(dplyr)

# create csv file for demo purpose
  in_file <- file.path(tempdir(), "in.csv")
  write.csv(women, in_file, row.names = FALSE, quote = FALSE)

# reading chunkwise and filtering
  women_chunked <-
  read_chunkwise(in_file, chunk_size = 3) %>%  #read only a few lines for the purpose of this example
  filter(height > 150) # This basically filters out most lines of the dataset,
                       # so for instance the first chunk (first 3 rows) should return an empty table

# Trying to read the output returns an error message
  women_chunked
  # >Error in UseMethod("groups") : 
  # >no applicable method for 'groups' applied to an object of class "NULL"

# As does of course trying to write the output to a file
  out_file <- file.path(tempdir(), "processed.csv")
  women_chunked %>%
    write_chunkwise(file=out_file)
  # >Error in read.table(con, nrows = nrows, sep = sep, dec = dec, header = header,  : 
  # >first five rows are empty: giving up

I am working on many csv files, each 50 millions rows, and will thus often end up in a similar situation where the filtering returns (at least for some chunks) an empty table.

I coudn't find a solution or any post related to on this problem. Any suggestions? I do not think the sessionInfo output is useful in this case, but please let me know if I should post it anyway. Thanks a lot for any help!

It makes sense that it returns an error, if no data would be read. You could use `try()` to catch the error and then handle the errors as you wish. — ek-g, Jul 01 '20 at 09:09
Thanks for your quick reply! If you apply a filter on dataframe (without using read_chunk_wise), you do not get an error, just an empty dataframe (but still with a header), which is something very easy to handle. Getting an error on the first chunk for example will prevent the function to continue to the next chunks. Finally, try() does not seem to work in this situation (or I did not understand how I should apply it). — arnelton, Jul 01 '20 at 09:20
I wonder if `tryCatch` would work in a non-piping solution... — Roman Luštrik, Jul 01 '20 at 09:30

Large file processing - error using chunked::read_csv_chunked with dplyr::filter

0 Answers0