5

I'm trying to add an index to a dataset which is too large to fit in RAM. The tidyverse way of adding an index would be:

library(tidyverse)
df = mtcars
df |>
  mutate(row_id = 1:nrow(cyl)) # any column name in the df

Dplyr backend for Arrow doesn't allow for this operation. How else can I do it?

Mark
  • 7,785
  • 2
  • 14
  • 34
David Budzynski
  • 169
  • 1
  • 6
  • Do you need to use `nrow(column name)`? `mutate(row_id = row_number())` or `mutate(row_id = 1:n())` will also works. – Park Jun 27 '22 at 07:28
  • It won't work until I call collect(). I want to avoid reading the data to memory since it's too big. Your example won't work: ``` > dat |> + mutate(index = row_number()) Error: Expression row_number() not supported in Arrow Call collect() first to pull data into R. ``` – David Budzynski Jun 27 '22 at 07:33
  • 1
    Out of interest, why do you need an index? I don't think this is possible, as there are no bindings for `row_number()` at the moment, but there may be an alternative way to achieve the thing you're trying to accomplish. – thisisnic Jun 27 '22 at 15:38
  • I want to use an index as a unique ID for each row. I have around 55 CSV files which are over 80 GB in total size, and I want a unique ID across all of those 55 files without reading them all into memory. – David Budzynski Jun 28 '22 at 10:17
  • Can't you just do it one file at a time? Also why are you keeping them as csv instead of parquet? – Dean MacGregor Aug 04 '22 at 11:13
  • Because I received these files as csv... Happy to see a reproducible example of your suggested solution! – David Budzynski Aug 04 '22 at 18:58
  • David I had a read through the documentation of `arrow`, the cheatsheet, and the cookbook. All of them have the same direction/point - you use the `dplyr`-like functions, then always end with `collect()`- to quote the docs, "No actual computations are performed until collect() (or the related compute() function) is called" – Mark Jun 21 '23 at 10:36
  • So the two possibilities which may provide options are compute() - "compute() stores results in a remote temporary table" https://dplyr.tidyverse.org/reference/compute.html . It seems like it maybe doesn't load all the data in, maaybe. And the other option is to break your data into further chunks, and work on those – Mark Jun 21 '23 at 10:47

0 Answers0