Your situation in terms of hardware and size of data seems like it is
not easy to replicate for the average user here. There were already some
good insights shared by others, but ultimately without access to your
hardware and data, it all stay speculative.
So what you could do is to run some benchmarks to optimise your approach
as different tools will work better depending on the bottleneck of your
setup. So here are some tips I found useful when benchmarking.
test data
Choose a representative set of data. In your original question, the
datasets/files were rather small, which creates unique challenges (e.g.,
parallelization creates some overhead, which might kill any gains in
speed for smaller files, but on larger files you might see huge
improvements as the overhead become negligible compared to the total run
time). I use your test data, but increase the data size slightly the
approach from the question but increase the number of data points.
df_1 <- data.frame(matrix(sample(1:30000), ncol = 3))
df_2 <- data.frame(matrix(sample(1:20000), ncol = 4))
filter_by_df <- data.frame(X1 = 1:100)
write.csv(df_1, "df_1.csv", row.names = FALSE)
write.csv(df_2, "df_2.csv", row.names = FALSE)
the functions
The benchmarking tool I recommend (bench
) wants to compare different
functions, so it makes sense to wrap your code into a function per
approach (there are workarounds but it’s easier this way).
The original approach with furrr
for parallelisation:
library(dplyr)
library(readr)
library(furrr)
filter_df <- function(x, pos) {
subset(x, X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1)
}
my_function <- function(files, cores) {
plan(multisession, workers = cores)
future_map_dfr(files, function(file) {
suppressMessages(read_csv_chunked(file,
callback = DataFrameCallback$new(filter_df),
progress = FALSE,
chunk_size = 50)) %>%
as.data.frame() %>%
distinct()
})
}
The arrow function by Matt
Summersgill.
library(arrow)
my_arrow <- function(files, cores) { # cores are ignored, just for comparability
DS <- arrow::open_dataset(sources = files,
format = "csv")
DS |>
filter(X1 %in% filter_by_df[["X1"]]| X2 %in% filter_by_df[["X1"]] ) |>
distinct() |>
collect()
}
Two slightly altered versions of
Sirius data.table
approach:
library(data.table)
my_data_table_native <- function(files, cores) {
setDTthreads(cores)
df_out <- data.table::rbindlist(lapply(files, data.table::fread), fill = TRUE)[ X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1 ]
}
my_data_table_furrr <- function(files, cores) {
setDTthreads(cores)
plan(multisession, workers = cores)
future_map_dfr(files, function(file) {
data.table::fread(file)[ X1 %in% filter_by_df$X1 | X2 %in% filter_by_df$X1 ]
})
}
benchmarks
I recommend the excellent bench
package for benchmarking as it can run
your functions in a grid, produces very detailed data and makes some
really nice plots for comparison. I set it up to test two things:
- what happens with different numbers of files
- what differents does paralleisation make
results <- bench::press(
n_files = c(2, 5, 10, 25, 50, 100, 200, 1000), # test different resamplings
n_cores = 1:4, # test different numbers of cores
{
bench_sample <- sample(c("df_1.csv", "df_2.csv"), n_files, replace = TRUE)
bench::mark(
my_function = my_function(bench_sample, n_cores),
my_arrow = my_arrow(bench_sample, n_cores),
my_data_table_native = my_data_table_native(bench_sample, n_cores),
my_data_table_furrr = my_data_table_furrr(bench_sample, n_cores),
iterations = 5L, # rerun each combination 5 times
check = FALSE # usually bench checks if the outcome is 100% identical, but the approaches differ in insubstantial ways
)
}
)
If you search in a grid, the easiest way to compare results is to use a
plot:
ggplot2::autoplot(results)

But you can also have a close look at the individual runs:
summary(results)
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 128 × 8
expression n_files n_cores min median `itr/sec` mem_alloc
<bch:expr> <dbl> <int> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 my_function 2 1 125.47ms 125.71ms 7.46 11.79MB
2 my_arrow 2 1 47.39ms 50.93ms 18.3 15.85MB
3 my_data_table_native 2 1 2.61ms 2.68ms 197. 3.21MB
4 my_data_table_furrr 2 1 17.95ms 18.27ms 51.2 1.28MB
5 my_function 5 1 256.08ms 256.81ms 3.87 8.11MB
6 my_arrow 5 1 42.06ms 43.8ms 22.2 167.66KB
7 my_data_table_native 5 1 5.76ms 6.19ms 166. 2.51MB
8 my_data_table_furrr 5 1 22.44ms 22.6ms 42.4 2.37MB
9 my_function 10 1 402.88ms 409.96ms 2.44 12.06MB
10 my_arrow 10 1 47.22ms 47.88ms 20.0 168.19KB
# … with 118 more rows, and 1 more variable: `gc/sec` <dbl>
summary(results, relative = TRUE)
Warning: Some expressions had a GC in every iteration; so filtering is
disabled.
# A tibble: 128 × 8
expression n_files n_cores min median `itr/sec` mem_al…¹ gc/se…²
<bch:expr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 my_function 2 1 53.5 53.0 323. 72.2 Inf
2 my_arrow 2 1 20.2 21.5 791. 97.1 Inf
3 my_data_table_native 2 1 1.11 1.13 8505. 19.7 Inf
4 my_data_table_furrr 2 1 7.66 7.71 2216. 7.84 NaN
5 my_function 5 1 109. 108. 168. 49.7 Inf
6 my_arrow 5 1 17.9 18.5 960. 1.00 Inf
7 my_data_table_native 5 1 2.46 2.61 7166. 15.4 NaN
8 my_data_table_furrr 5 1 9.57 9.53 1833. 14.5 Inf
9 my_function 10 1 172. 173. 105. 73.9 Inf
10 my_arrow 10 1 20.1 20.2 863. 1.01 Inf
# … with 118 more rows, and abbreviated variable names ¹mem_alloc, ²`gc/sec`
I like to look at the second one since you can quickly see, for example that my_data_table_native with two files is roughly 7 times faster than my_data_table_native.
What sticks out is that parallelisation only improves the original
function when there are a lot of files involved. But parallelisation
always hurts performance with data.table
, likely because the workers
are standing in each other’s way and the bottleneck is the disk read
speed. So my conclusion would be to use the normal data.table
approach
for up to about 50 files of the size used here and otherwise use
arrow
.
However, your hardware is very different to mine. In the comments you
said that parallelisation actually improves the processing speeds, which
could point to a CPU bottleneck, while I seem mostly limited by the SSD
read speed of my machine. Matt
Summersgill
also pointed out that data.table
is heavy on memory usage for larger
files, which could explain why your machine slows down with that
generally very fast package. Ultimately you will need to run the
benchmark yourself to find out.