Random sampling of parquet prior to collect

Question

I want to randomly sample a dataset. If I already have that dataset loaded, I can do something like this:

library(dplyr)

set.seed(-1)

mtcars %>% slice_sample(n = 3)
#               mpg cyl  disp  hp drat    wt qsec vs am gear carb
# AMC Javelin  15.2   8 304.0 150 3.15 3.435 17.3  0  0    3    2
# Ferrari Dino 19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
# Merc 240D    24.4   4 146.7  62 3.69 3.190 20.0  1  0    4    2

But my dataset is stored as a parquet file. As an example, I'll create a parquet from mtcars:

library(arrow)

# Create parquet file
write_dataset(mtcars, "~/mtcars", format = "parquet")

open_dataset("~/mtcars") %>% 
  slice_sample(n = 3) %>% 
  collect()
  
# Error in UseMethod("slice_sample") : 
#   no applicable method for 'slice_sample' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"

Clearly, slice_sample isn't implemented for parquet files and neither is slice:

open_dataset("~/mtcars") %>% nrow() -> n

subsample <- sample(1:n, 3)

open_dataset("~/mtcars") %>% 
  slice(subsample) %>% 
  collect()

# Error in UseMethod("slice") : 
#   no applicable method for 'slice' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"

Now, I know filter is implemented, so I tried that:

open_dataset("~/mtcars") %>% 
  filter(row_number() %in% subsample) %>% 
  collect()

# Error: Filter expression not supported for Arrow Datasets: row_number() %in% subsample
# Call collect() first to pull data into R.

(This also doesn't work if I create a filtering vector first, e.g., foo <- rep(FALSE, n); foo[subsample] <- TRUE and use that in filter.)

This error offers some helpful advice, though: collect the data and then subsample. The issue is that the file is ginormous. So much so, that it crashes my session.

Question: is there a way to randomly subsample a parquet file before loading it with collect?

I just added the [tag:parquet] tag, hoping it'll bring in relevant non-R users. I'm learning parquet as well, not sure I can answer but I've run into similar areas where parquet filtering is not implemented in some way. — r2evans, Sep 09 '22 at 15:06
Are you able to modify the method importing data into parquet such that the row number is within the data? If so, then `filter(rownum %in% subsample)` should work. — r2evans, Sep 09 '22 at 15:10
I'm not really an R user enough to formulate the answer real quick but a common hack is to add a column of random data (mutate can add a column but I don't recall if r-arrow has a `random` function added) and then filter like `rand_col < 0.05` (to get 5% of the data) but you will have an inexact sample size. Another possibility is to modulo the row number (e.g. rownum modulo 20 == 1) but that won't exactly be random. — Pace, Sep 09 '22 at 15:57
Thanks for the replies & suggestions! Based on them, I tried a couple of approaches: add a column of random data (e.g., `mutate(row = runif(n))`) and a column of row numbers (e.g., `mutate(row =1:n)`) before the `collect`. Both wouldn't even allow me to add the column for some reason. It seems that I could only add columns when it altered an existing column (e.g., `mutate(foo = hp * 2)`) or was a constant value (e.g., `mutate(foo = 6)`). — Dan, Sep 09 '22 at 16:24
@Pace I also tried `filter(row_number() %% 20 == 1)`, but got `Error: Filter expression not supported for Arrow Datasets`. — Dan, Sep 09 '22 at 16:29

score 1 · Accepted Answer · answered Sep 09 '22 at 18:03

It turns out that there is an example in the documentation that pretty much fulfils my goal. That example is a smidge dated, as it uses sample_frac which has been superseded rather than slice_sample, but the general principle holds so I've updated it here. As I don't know how many batches there will be, here I show how it can be done with proportions, like Pace suggested, instead of pulling a fixed number of columns.

One issue with this approach is that (as far as I understand) it does require that the entire dataset is read in, it just does it in batches rather than in one go.

open_dataset("~/mtcars") %>%
  map_batches(~ as_record_batch(slice_sample(as.data.frame(.), prop = 0.1))) %>%
  collect()

#    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# 1 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# 2 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
# 3 15.8   8  351 264 4.22 3.170 14.50  0  1    5    4

Yes. That will require loading all data from the disk. Loading a sampling of rows from disk is a bit of a challenging problem because columns are often persisted in indivisible chunks (usually due to compression, encodings, etc). Depending on your use case you can often get away with a random sample of batches instead of a random sample of rows. Many file formats (e.g. parquet, arrow) support being stored and retrieved in batches (row groups, record batches). — Pace, Sep 09 '22 at 18:51

Random sampling of parquet prior to collect

1 Answers1