UPDATE: cross-posted as arrow bug on JIRA (fingers crossed for arrow developer helpsoon)
I am having trouble using arrow in R. First, I saved some data.tables
(d
) that were about 50-60Gb in memory to a parquet file using:
d %>% write_dataset(f, format='parquet') # f is the directory name
Then I try to read open the file, select the relevant variables and
tic()
d2 <- open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable names
toc()
I did this conversion for 3 data.tables, lets say A, B, C (unfortunately, data is confidential so I can't include in the example). For set A, I was able to open>select>collect
the desired table in about 60s, obtaining a 10Gb file (after variable selection).
For B and C, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB.
Any ideas on what could be going on here?
UPDATES ( from discussion in the comments):
- All files were written/read by the same Windows Server
- All files have similar types of variables: chr, num, int, integer64 and Date(y-m-d)
- Errors are independent of the order in which files are opened (A first, B first, etc). Opening A twice in a row does not cause errors.