issue with memory usage with arrow package in R

Question

I have a dataset as 17GB. Because of size of it, sometimes data manipulations meet the RAM shortage and brake down. I try to use arrow as a package developed for manipulations with data over the RAM size. After reading the csv-file with read_csv_arrow I got the 24GB used memory. In comparison, when I download the regular csv-file it require 17GB only and run much faster.

library(dplyr)
library(arrow)

Sales_Arrow <- read_csv_arrow("Budget_2023 Sales All.csv", as_data_frame = FALSE)

The Memory usage is: | Statistics | Memory | |:-------------------|------------| | Used by session | 15,288 MiB | | Used by system | 8,351 MiB | | Free system memory | 8,924 MiB |

The next step when filter data and do some manipulations in RAM takes all free memory

Current_Price <- collect(filter(Sales_Arrow, SBPHYP >= 202101)) %>%
  group_by(`Cust-Item-Loc`) %>%
  arrange(desc(SBINDT)) %>% 
  slice(1)

and doesn't perform providing the following Error message

Error: cannot allocate vector of size 64.0 Mb
Error during wrapup: cannot allocate vector of size 64.0 Mb
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

Please advise

https://francoismichonneau.net/2022/10/import-big-csv/. I think you want to use `open_dataset()` so that the file is not loaded into memory. — Jon Spring, Oct 14 '22 at 18:59
Thank you. But when I use ```open_dataset()``` and try to do anything (filter, write_dataset, etc.) I meet the error "Invalid: In CSV column #4: Row #444155: CSV conversion error to int64: invalid value '5e+06'" — grislepak, Oct 14 '22 at 19:40
It sounds like row 444,155 of the source CSV has the number in scientific notation. Can you fix upsteam? Is the error caused at the `open_dataset` step or subsequently? I wonder if fixed if you try`open_dataset(YOURFILE, partitioning = schema(YOUR_COLUMN_4 = string()))` so that it coerces that column into string first, for your conversion to numeric downstream, keeping in mind some values look like "5e+06." — Jon Spring, Oct 14 '22 at 19:44
Related, an answer about coercing data types in `arrow::open_dataset`: https://stackoverflow.com/a/71305598/6851825 — Jon Spring, Oct 14 '22 at 19:50
I got this error for ```Sales_Arrow %>% group_by(SBPHYP) %>% write_dataset(path)``` — grislepak, Oct 14 '22 at 19:59

issue with memory usage with arrow package in R

0 Answers0