4

I have 10 very large CSV files (which may or may not have the same headers) that I am reading in and processing consecutively in chunks using "readr" package read_csv_chunked(). Currently, I am able to read the 10 files in parallel using 10 cores. The process still takes an hour. I have 128 cores. Can I split each CSV into 10 chunks to process in parallel for each of the files hence utilizing 100 cores? Here is what I currently have (creating two sample files only for testing):

library(doParallel)
library(foreach)

# Create a list of two sample CSV files and a filter by file
df_1 <- data.frame(matrix(sample(1:300), ncol = 3))
df_2 <- data.frame(matrix(sample(1:200), ncol = 4))
filter_by_df <- data.frame(X1 = 1:100)

write.csv(df_1, "df_1.csv", row.names = FALSE)
write.csv(df_2, "df_2.csv", row.names = FALSE)

files <- c("df_1.csv", "df_2.csv")

# Create a function to read and filter each file in chunks
my_function <-
  function(file) {
    library(dplyr)
    library(plyr)
    library(readr)
    
    filter_df <-
      function(x, pos) {
        subset(x, X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1)
      }
    
    readr_df <-
      read_csv_chunked(file,
                       callback = DataFrameCallback$new(filter_df),
                       progress = F,
                       chunk_size = 50) %>%
      as.data.frame() %>%
      distinct()
    return(readr_df)
  }

# Apply the custom function created above to all files in parallel and combine them
df_foreach <-
  foreach(i = files, .combine = rbind.fill, .packages = c("plyr")) %dopar% my_function(i)

I am researching nested foreach() but am not sure how to pass different functions in a nested manner (read_csv_chunked with .combine = rbind and my custom function filter_df() with .combine = rbinf.fill()). I also looked into "future" package. Any advice is greatly appreciated. Thank you.

Annabanana
  • 91
  • 1
  • 3
  • 13
  • 1
    The delay may not be related to number of processors, but limited the speed of the IO connections. Having more open IO connections may make things worst and not better. – Dave2e Feb 28 '23 at 23:09
  • Each chunk of a csv file still reads from the beginning to find its chunk start. For parallel speed, you're better off either splitting the csv file into many files or converting to a binary format where chunk starts can be computed without reading. – George Ostrouchov Mar 01 '23 at 04:37
  • @Dave2e, Thank you for your response. The largest of the files takes about an hour to run by itself. So that is not the issue. That is why I was hoping that processing each file in chunks in parallel could speed up the process. – Annabanana Mar 01 '23 at 15:30
  • @GeorgeOstrouchov, I will look into both options. Thank you. – Annabanana Mar 01 '23 at 15:32
  • 1
    Do you really have 128 cores or is this a virtual machine? Some specific details about your computer and the files (size on disk, ncol, nrow) might be useful for such a question. – Hugh Mar 09 '23 at 13:24
  • 1
    @Hugh, it's a dedicated RStudio server I virtually sign into. `library(parallel)` `detectCores()` gives me 128. There are 10 files with 62 to 63 cols having an average of 160000000 rows each. – Annabanana Mar 09 '23 at 21:22
  • 1
    Do you share this server (i.e. do you actually have 128 cores just for you and your disk)? On my 8 year old computer with 10 cores a file with more rows and columns is read in about 90s. I suspect this is a hardware or server optimization issue rather than something solvable at the R level. – Hugh Mar 10 '23 at 01:24
  • @Hugh, yes. It takes about 10 minutes to process them in parallel in bash. I don't understand why R takes an hour. Any suggestions on how to resolve this? – Annabanana Mar 10 '23 at 15:43
  • Just a small comment. Can also try using [futures](https://cran.r-project.org/web/packages/doFuture/vignettes/doFuture.html) for asynchronic parallels, might reduce the wait time somewhat.. – runr Mar 11 '23 at 00:51
  • Do you have a specific code example you could post as a solytiin @runr? – Annabanana Mar 12 '23 at 04:40
  • 1
    @Annabanana maybe it would suffice apply ``doFutures`` straightforwardly on your best working code, unless you've already tried that? I.e., following your example, ``library(doFuture);registerDoFuture();plan(multisession);df_foreach <- foreach(i = files, .combine = rbind.fill, .packages = c("plyr")) %dopar% my_function(i)``? If that works, next step could be to re-optimize the chunk size for futures. – runr Mar 12 '23 at 23:24
  • Thank you @runr. I have tried `future` but I will try your specific code and keep you posted on performance. – Annabanana Mar 13 '23 at 15:12

3 Answers3

6

As mentioned by Sirius, data.table::fread() is hands down the fastest csv reader for R, and the built-in multi-threading should make good use of the resources you have at your disposal.

One other thought though - since you only want a subset of the rows in your final result, the arrow package is a great option here. Instead of reading the entire file into memory, you can use arrow's "push-down" capabilities to scan a multi-file dataset and only read the rows matching your criteria into memory.

arrow partially implements the functionality of dplyr, so depending on your experience, the syntax may already be familiar.

library(arrow)
library(dplyr)

DS <- arrow::open_dataset(sources = c("df_1.csv","df_1.csv"),
                          format = "csv")

DS |> 
  filter(X1 %in% filter_by_df[["X1"]]| X2 %in% filter_by_df[["X1"]] ) |> 
  distinct() |> 
  collect() 

One note - this answer assumes an R version of 4.1 or greater. For earlier R versions, the magrittr pipe %>% can be used in lieu of the base R pipe - |> - introduced 4.1.

Matt Summersgill
  • 4,054
  • 18
  • 47
  • I have never used arrow and may need some clarification. Are you suggesting to incorporate arrow with fread() or is your solution above the full code? – Annabanana Mar 08 '23 at 21:53
  • I am proposing to use functions from the arrow package _instead_ of `fread()` for this use case. The code in the answer covers all the steps you outlined in your question. – Matt Summersgill Mar 08 '23 at 22:02
  • unfortunately the environment I have at my disposal does not permit me to install new packages for security reasons. I do not currently have arrow library. I can try to request and test your solution but I am not sure when and if my request will be granted. I tried sqldf to filter before loading as well as embedded "awk" statement. I also tried mclapply() from parallel. All were slower than foreach() from foreach. I was hoping there was a way to do nested foreach() as it is fast and available to me. – Annabanana Mar 08 '23 at 22:34
  • 1
    That's too bad - if your work involves working with large data sets (10-100GB) on a regular basis, I'd really recommend doing some reading on the Apache Arrow project and the associated R package and pushing to get it installed. Outside of that, `fread()` coupled with command line tools like `awk` to avoid loading the entire dataset into memory are your best bet. If you want to continue on the path of splitting and processing in parallel, the command line tool `split` is probably worth looking into. – Matt Summersgill Mar 09 '23 at 14:17
  • Thanks, @MattSummersgill. I talked my admin into installing it but got a "Cannot install arrow: a C++17 compiler is required..." – Annabanana Mar 09 '23 at 21:03
  • Hmm. What operating system are you on? Installing pre-built binaries might be a workaround to avoid issues with the build toolchain - some of the options are outlined here: https://arrow.apache.org/docs/r/articles/install.html – Matt Summersgill Mar 09 '23 at 21:23
  • Red Hat Version 7.6. – Annabanana Mar 09 '23 at 21:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/252425/discussion-between-matt-summersgill-and-annabanana). – Matt Summersgill Mar 09 '23 at 21:34
  • 1
    Perhaps your system allows you to work with containers (eg. docker, podman or singularity)? (Then you can freely install ahything you like without tampering with the host system) – Sirius Mar 09 '23 at 23:37
  • @Sirius, no, cannot install anything unfortunately. – Annabanana Mar 10 '23 at 19:49
2

For processing large csv files I would only ever use fread() from data.table.

One way to achieve what you want would be this:

library(data.table)
library(purrr)

# Create a list of two sample CSV files and a filter by file
set.seed(100)
df_1 <- data.frame(matrix(sample(1:300), ncol = 3))
df_2 <- data.frame(matrix(sample(1:200), ncol = 4))
filter_by_df <- data.frame(X1 = 1:100)

write.csv(df_1, "df_1.csv", row.names = FALSE)
write.csv(df_2, "df_2.csv", row.names = FALSE)

files <- c("df_1.csv", "df_2.csv")

filedata <- map(
    files,
    ~{
        d <- fread(.x)
        d[ X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1 ]
    }
)

df_final <- rbindlist( filedata, fill=TRUE )

head( df_final )

It produces this:

> head( df_final )
    X1  X2  X3 X4
1: 206 100  54 NA
2:   4 276 144 NA
3:  98  27 205 NA
4:   7 159 134 NA
5: 146  71 180 NA
6: 258  39 176 NA

As the others have alluded to, your system most likely can parse csv faster than the storage backend can serve the csv data anyway, so parallelizing over multiple cores might not help, or would help only marginally.

Also as @Roland pointed out before I edited out my first parallellized approach, fread is already parallellized, no need to do it again.

You should go ahead and see how fast fread() can read your file, and then see if reading the same file at the same time from two different R sessions run the same speed or slower.

Sirius
  • 5,224
  • 2
  • 14
  • 21
  • 5
    Don't do this. `fread` is parallelized internally. (Use `data.table::setDTthreads` to set the number of CPUs.) Read the files sequentially. – Roland Mar 07 '23 at 07:48
  • @Sirius, thank you for the recommendation. I tested your solution and instead of 50 minutes with my original code that uses foreach() and read_csv_chunked(), it's been runnining for the past 5 hours. – Annabanana Mar 08 '23 at 21:21
  • That's.. interesting. How large are these files exactly, and how much RAM does your system have? – Sirius Mar 08 '23 at 22:08
  • About 264 gigs. Files vary between 30 and 56 GB. – Annabanana Mar 09 '23 at 21:36
  • Could you time reading a single file with fread? And then time the same file with read_chunked? – Sirius Mar 10 '23 at 16:18
  • @Sirius, as I mentioned in another comment, `read_csv_chunked()` takes 25 to 50 minutes depending on file size (making the parallel process as long as the longest file, 50 min) while `fread()` ran for eight hours before I killed it. – Annabanana Mar 10 '23 at 19:54
  • Yes I saw, but was that fread on a single file or all in parallell (which was my first less than optimal first suggestion). What if you pick one of the files, try to read it with fread, then with chunked. If it's what you did and it's 8 hours, then fine! – Sirius Mar 11 '23 at 17:09
  • @Sirius 8+ hours with `fread()` was just for one file. One file with `read_csv_chunked()` took 25-50 minutes per file. – Annabanana Mar 12 '23 at 17:47
  • thanks, seems fread is not the way to go – Sirius Mar 12 '23 at 20:06
2

Your situation in terms of hardware and size of data seems like it is not easy to replicate for the average user here. There were already some good insights shared by others, but ultimately without access to your hardware and data, it all stay speculative.

So what you could do is to run some benchmarks to optimise your approach as different tools will work better depending on the bottleneck of your setup. So here are some tips I found useful when benchmarking.

test data

Choose a representative set of data. In your original question, the datasets/files were rather small, which creates unique challenges (e.g., parallelization creates some overhead, which might kill any gains in speed for smaller files, but on larger files you might see huge improvements as the overhead become negligible compared to the total run time). I use your test data, but increase the data size slightly the approach from the question but increase the number of data points.

df_1 <- data.frame(matrix(sample(1:30000), ncol = 3))
df_2 <- data.frame(matrix(sample(1:20000), ncol = 4))
filter_by_df <- data.frame(X1 = 1:100)

write.csv(df_1, "df_1.csv", row.names = FALSE)
write.csv(df_2, "df_2.csv", row.names = FALSE)

the functions

The benchmarking tool I recommend (bench) wants to compare different functions, so it makes sense to wrap your code into a function per approach (there are workarounds but it’s easier this way).

The original approach with furrr for parallelisation:

library(dplyr)
library(readr)
library(furrr)

filter_df <- function(x, pos) {
  subset(x, X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1)
}

my_function <- function(files, cores) {
  plan(multisession, workers = cores)
  future_map_dfr(files, function(file) {
    suppressMessages(read_csv_chunked(file,
                                      callback = DataFrameCallback$new(filter_df),
                                      progress = FALSE,
                                      chunk_size = 50)) %>%
      as.data.frame() %>%
      distinct()
  })
}

The arrow function by Matt Summersgill.

library(arrow)

my_arrow <- function(files, cores) { # cores are ignored, just for comparability
  
  DS <- arrow::open_dataset(sources = files,
                            format = "csv")
  
  DS |> 
    filter(X1 %in% filter_by_df[["X1"]]| X2 %in% filter_by_df[["X1"]] ) |> 
    distinct() |> 
    collect()
}

Two slightly altered versions of Sirius data.table approach:

library(data.table)
my_data_table_native <- function(files, cores) {
  setDTthreads(cores)
  df_out <- data.table::rbindlist(lapply(files, data.table::fread), fill = TRUE)[ X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1 ]
}


my_data_table_furrr <- function(files, cores) {
  setDTthreads(cores)
  plan(multisession, workers = cores)
  future_map_dfr(files, function(file) {
    data.table::fread(file)[ X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1 ]
  })
}

benchmarks

I recommend the excellent bench package for benchmarking as it can run your functions in a grid, produces very detailed data and makes some really nice plots for comparison. I set it up to test two things:

  • what happens with different numbers of files
  • what differents does paralleisation make
results <- bench::press(
  n_files = c(2, 5, 10, 25, 50, 100, 200, 1000), # test different resamplings
  n_cores = 1:4, # test different numbers of cores
  {
    bench_sample <- sample(c("df_1.csv", "df_2.csv"), n_files, replace = TRUE)
    bench::mark(
      my_function = my_function(bench_sample, n_cores),
      my_arrow = my_arrow(bench_sample, n_cores),
      my_data_table_native = my_data_table_native(bench_sample, n_cores),
      my_data_table_furrr = my_data_table_furrr(bench_sample, n_cores),
      iterations = 5L, # rerun each combination 5 times
      check = FALSE # usually bench checks if the outcome is 100% identical, but the approaches differ in insubstantial ways
    )
  }
)

If you search in a grid, the easiest way to compare results is to use a plot:

ggplot2::autoplot(results)

benchmark plot

But you can also have a close look at the individual runs:

summary(results)
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 128 × 8
   expression           n_files n_cores      min   median `itr/sec` mem_alloc
   <bch:expr>             <dbl>   <int> <bch:tm> <bch:tm>     <dbl> <bch:byt>
 1 my_function                2       1 125.47ms 125.71ms      7.46   11.79MB
 2 my_arrow                   2       1  47.39ms  50.93ms     18.3    15.85MB
 3 my_data_table_native       2       1   2.61ms   2.68ms    197.      3.21MB
 4 my_data_table_furrr        2       1  17.95ms  18.27ms     51.2     1.28MB
 5 my_function                5       1 256.08ms 256.81ms      3.87    8.11MB
 6 my_arrow                   5       1  42.06ms   43.8ms     22.2   167.66KB
 7 my_data_table_native       5       1   5.76ms   6.19ms    166.      2.51MB
 8 my_data_table_furrr        5       1  22.44ms   22.6ms     42.4     2.37MB
 9 my_function               10       1 402.88ms 409.96ms      2.44   12.06MB
10 my_arrow                  10       1  47.22ms  47.88ms     20.0   168.19KB
# … with 118 more rows, and 1 more variable: `gc/sec` <dbl>
summary(results, relative = TRUE)
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.

# A tibble: 128 × 8
   expression           n_files n_cores    min median `itr/sec` mem_al…¹ gc/se…²
   <bch:expr>             <dbl>   <int>  <dbl>  <dbl>     <dbl>    <dbl>   <dbl>
 1 my_function                2       1  53.5   53.0       323.    72.2      Inf
 2 my_arrow                   2       1  20.2   21.5       791.    97.1      Inf
 3 my_data_table_native       2       1   1.11   1.13     8505.    19.7      Inf
 4 my_data_table_furrr        2       1   7.66   7.71     2216.     7.84     NaN
 5 my_function                5       1 109.   108.        168.    49.7      Inf
 6 my_arrow                   5       1  17.9   18.5       960.     1.00     Inf
 7 my_data_table_native       5       1   2.46   2.61     7166.    15.4      NaN
 8 my_data_table_furrr        5       1   9.57   9.53     1833.    14.5      Inf
 9 my_function               10       1 172.   173.        105.    73.9      Inf
10 my_arrow                  10       1  20.1   20.2       863.     1.01     Inf
# … with 118 more rows, and abbreviated variable names ¹​mem_alloc, ²​`gc/sec`

I like to look at the second one since you can quickly see, for example that my_data_table_native with two files is roughly 7 times faster than my_data_table_native.

What sticks out is that parallelisation only improves the original function when there are a lot of files involved. But parallelisation always hurts performance with data.table, likely because the workers are standing in each other’s way and the bottleneck is the disk read speed. So my conclusion would be to use the normal data.table approach for up to about 50 files of the size used here and otherwise use arrow.

However, your hardware is very different to mine. In the comments you said that parallelisation actually improves the processing speeds, which could point to a CPU bottleneck, while I seem mostly limited by the SSD read speed of my machine. Matt Summersgill also pointed out that data.table is heavy on memory usage for larger files, which could explain why your machine slows down with that generally very fast package. Ultimately you will need to run the benchmark yourself to find out.

JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • 2
    A note on file size here - in some follow up comments, the original poster indicated that CSV files are 30-56 GB for a total of 264 GB on disk. That would make the smallest of their files **22,727x** larger than `df_1` used here in this benchmark. To make this relevant to their question, the file size and quantity needs to be more aligned. – Matt Summersgill Mar 12 '23 at 14:44
  • @MattSummersgill, correct. Using methods available to me (`sqldf`, `pipe` with `awk`, `readr`, `parallel`, `foreach`, etc...) with production data `readr` was BY FAR the fastest method. `data.table` was probably the worst method. I was unable to test `arrow` on production data. Thank you so much for your contribution. – Annabanana Mar 12 '23 at 17:27
  • @JBGruber, thank you for your thorough analysis. I have ten huge files and processing them in parallel works great. The process runs as long as it takes to process the largest file (which is 50 minutes.) I was hoping to find a method that speeds up the processing of each file by further parallelizing individual file processing. I cannot install `arrow` but `readr` `read_csv_chunked` was by far the fastest method of those available in production. For some reason `data.table` was the slowest for me. – Annabanana Mar 12 '23 at 17:35
  • 1
    How much RAM does your system have, and did you monitor RAM usage with data.table? Memory usage during a full read operation typically exceeds the file size on disk, so it's likely that the memory allocation by fread exceeded your RAM available, causing the process to use swap storage instead, which is incredibly slow. – Matt Summersgill Mar 12 '23 at 19:30
  • I see now that I missed some of the discussion on the topic. I don't have similar high performance hardware available and it's starting to sound to me like your bottleneck is quite specific. In my case it seems that the storage is the bottleneck as throwing more cpu on it does not help (rather the opposite). If your files are stored on different physical disks (e.g., in a raid configuration) I can see that processing the files could become the bottleneck. Matt points out that for your file size, memory can be an issue too. I don't think there is a generalisable solution as I first assumed. – JBGruber Mar 12 '23 at 19:37
  • But maybe the benchmarking approached I showed could help? Furr should still be a valid option and bench::press can take more variables in the grid (you could test for example how many cores work best). If I have time tomorrow I'll try to turn this into a benchmarking guide instead. – JBGruber Mar 12 '23 at 19:39
  • I re-wrote the answer and added more explanation on how to do your own benchmarking, which I think would be the best option in your situation. – JBGruber Mar 13 '23 at 14:25