How to add a column with an index to an apache arrow dataset in R?

Question

I'm trying to add an index to a dataset which is too large to fit in RAM. The tidyverse way of adding an index would be:

library(tidyverse)
df = mtcars
df |>
  mutate(row_id = 1:nrow(cyl)) # any column name in the df

Dplyr backend for Arrow doesn't allow for this operation. How else can I do it?

Do you need to use `nrow(column name)`? `mutate(row_id = row_number())` or `mutate(row_id = 1:n())` will also works. — Park, Jun 27 '22 at 07:28
It won't work until I call collect(). I want to avoid reading the data to memory since it's too big. Your example won't work: ``` > dat |> + mutate(index = row_number()) Error: Expression row_number() not supported in Arrow Call collect() first to pull data into R. ``` — David Budzynski, Jun 27 '22 at 07:33
Out of interest, why do you need an index? I don't think this is possible, as there are no bindings for `row_number()` at the moment, but there may be an alternative way to achieve the thing you're trying to accomplish. — thisisnic, Jun 27 '22 at 15:38
I want to use an index as a unique ID for each row. I have around 55 CSV files which are over 80 GB in total size, and I want a unique ID across all of those 55 files without reading them all into memory. — David Budzynski, Jun 28 '22 at 10:17
Can't you just do it one file at a time? Also why are you keeping them as csv instead of parquet? — Dean MacGregor, Aug 04 '22 at 11:13
Because I received these files as csv... Happy to see a reproducible example of your suggested solution! — David Budzynski, Aug 04 '22 at 18:58
David I had a read through the documentation of `arrow`, the cheatsheet, and the cookbook. All of them have the same direction/point - you use the `dplyr`-like functions, then always end with `collect()`- to quote the docs, "No actual computations are performed until collect() (or the related compute() function) is called" — Mark, Jun 21 '23 at 10:36
So the two possibilities which may provide options are compute() - "compute() stores results in a remote temporary table" https://dplyr.tidyverse.org/reference/compute.html . It seems like it maybe doesn't load all the data in, maaybe. And the other option is to break your data into further chunks, and work on those — Mark, Jun 21 '23 at 10:47

How to add a column with an index to an apache arrow dataset in R?

0 Answers0