R arrow, open_dataset %>% select(myvars) %>% collect causing memory leak

Question

UPDATE: cross-posted as arrow bug on JIRA (fingers crossed for arrow developer helpsoon)

I am having trouble using arrow in R. First, I saved some data.tables (d) that were about 50-60Gb in memory to a parquet file using:

d %>% write_dataset(f, format='parquet')  # f is the directory name

Then I try to read open the file, select the relevant variables and

tic()
d2 <-  open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable names
toc()

I did this conversion for 3 data.tables, lets say A, B, C (unfortunately, data is confidential so I can't include in the example). For set A, I was able to open>select>collect the desired table in about 60s, obtaining a 10Gb file (after variable selection).

For B and C, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

UPDATES ( from discussion in the comments):

All files were written/read by the same Windows Server
All files have similar types of variables: chr, num, int, integer64 and Date(y-m-d)
Errors are independent of the order in which files are opened (A first, B first, etc). Opening A twice in a row does not cause errors.

Not sure, but in case you weren't aware ... 50-60GB on disk is not the same as 50-60GB in memory, its use in memory tends to explode based on several factors. This is not necessarily "the" issue since you say the host has 512GB of memory. — r2evans, Oct 27 '22 at 14:18
Sorry, I misread it three times and then commented. It sounds like there might be a leak, but I don't know how to find/troubleshoot/improve it, sorry. — r2evans, Oct 27 '22 at 16:57
Some thoughts: (1) does the order of which you read matter? (2) do the objects' columns have the same (or mostly-the-same) classes? is any of them "large" (long strings) or complex (non-atomic, e.g., `POSIXlt`)? — r2evans, Oct 27 '22 at 17:00
So again, does order-of-load matter? How about repeat loading? For instance, load the first one, then reload (from scratch) the first one again. Does this produce the same symptoms? Or is it one dataset directory in particular that causes the problem? Is there anything known about one that is not present in the other, for instance presence of `NaN`, wider range of data, written by a different writer/machine/OS, etc. Which brings up another question ... are they all written by the same version of `arrow`? I know you said you wrote it here, wondering if you simplified the question a little. — r2evans, Oct 28 '22 at 11:27
(I'm peppering with questions because I'm embarking on a new branch of a project where we're using the parquet directory dataset method as a larger-scale datamart/cache. I've done testing but not long-running or repeat-loading of large-ish tables, so I would not have yet seen any leakage.) — r2evans, Oct 28 '22 at 11:29
Does (1b) "loading works" mean that repeat loading does or does not leak memory? If not, that suggests that whatever is causing the problem is specific to _that one file_, where the first one you load is not affected by this problem. Is that a correct determination? — r2evans, Oct 28 '22 at 20:34
@r2evans, I updated the question to include the clarifications and tests you suggested. Perhaps we can delete the comments for now — LucasMation, Oct 29 '22 at 12:03
What are the properties of each of the three files? Specifically file size and rows/columns. Can you create `B` with a subset of rows and retest? What I'm wondering is if there's a point with `B` where it switches from "leaks" to "does not leak" by reducing the rows. — r2evans, Oct 29 '22 at 14:01
A, B,C are all around 66-70mm obs, 100-120 cols. I was testing recreating "B" today (SAT) when the server use is much lower. I was able to write B for a year, then "open_dataset > select > collect". I am still trying to investigate what caused the problem in the original B file. — LucasMation, Oct 29 '22 at 14:06
I deleted the comments above related to topics already included in the OP update. One leftover test was: 4) The next thing I want to test is if the fact that the original data was a data.table and not a tibble could be causing theese errors — LucasMation, Oct 29 '22 at 14:08
I'm very interested to see if you find causative issues there. I've done my own testing (with smaller data, admittedly) and found no difference. I think the reason `data.table` and `tibble` are even "preserved" is because of a relatively recent change to `arrow::write_parquet` that preserves a frame's attributes in the parquet file. To me this means that the data is the data and it is written identically, the only change was adding a little more metadata to the parquet itself. If it is causing problems, that suggests the _reading_ side of things is broken in `arrow` (or into `data.table`). — r2evans, Oct 29 '22 at 14:44
More test. Saved the file again. For a while I though, the problem was solved,I was able to read B. I tried only reading 3 variables this time. Collecting this smaller object, I returned from the collect command, Enviroment planel showed the object. Console became responsive. But stll memory use kept going up. I was ablt to reduce that by issuing a gc() command. — LucasMation, Oct 31 '22 at 01:49
Even after removing all the objects created by open_dataset>collect and issuing gc(), R is still using 69Gb of RAM. — LucasMation, Oct 31 '22 at 01:51
@r2evans, check the JIRA link I added to the OP, and the answer there, with some hints. At this point, I don't know if there is much we can do besides waiting from them to fix these bugs... — LucasMation, Oct 31 '22 at 15:09
Intesesting. Some concerning comments from the dupe jira issue 17541: *"R is holding onto memory when it isn't clear to me it should even be able to see the memory"* followed a bit later with *"this only seems to happen with datasets with a particular schema"*. Odd indeed. At least it's been documented and is reproducible, fingers-crossed for resolving the gc-problem. — r2evans, Oct 31 '22 at 15:55

R arrow, open_dataset %>% select(myvars) %>% collect causing memory leak

0 Answers0