I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:
library(dplyr)
set.seed(-1)
mtcars %>% slice_sample(n = 3)
# mpg cyl disp hp drat wt qsec vs am gear carb
# AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.3 0 0 3 2
# Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.0 1 0 4 2
But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars
:
library(arrow)
# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")
open_dataset("~/mtcars") %>%
slice_sample(n = 3) %>%
collect()
# Error in UseMethod("slice_sample") :
# no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Clearly, slice_sample
isn't implemented for parquet files and neither is slice
:
open_dataset("~/mtcars") %>% nrow() -> n
subsample <- sample(1:n, 3)
open_dataset("~/mtcars") %>%
slice(subsample) %>%
collect()
# Error in UseMethod("slice") :
# no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
Now, I know filter
is implemented, so I tried that:
open_dataset("~/mtcars") %>%
filter(row_number() %in% subsample) %>%
collect()
# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.
(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE
and use that in filter
.)
This error offers some helpful advice, though: collect
the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.
Question: is there a way to randomly subsample a parquet file before loading it with collect
?