R with package arrow sums results in NA at random

Question

When using package arrow with R and dplyr, summarizing a variable results in NA at random, while there is no NA in the data.

Example:

library(arrow)
library(dplyr)

td <- tempdir()
tzip <- file.path(td, "reprex.zip")
download.file("https://drive.google.com/uc?export=download&id=1-KefpiALDtUg0PrCUgpMAaE0903jVWWm", destfile = tzip)
unzip(tzip, exdir = td)
tlink <- file.path(td, "co_ano_mes=1997-01-01")

arrow_dset <- arrow::open_dataset(
     tlink,
     format = "parquet"
)

arrow_dset%>%
count(fluxo, vl_frete_miss=is.na(vl_frete)) %>% 
  collect()

fluxo vl_frete_miss     n   
<chr> <lgl>         <int>  
exp   TRUE          35546  
imp   FALSE         42332

replicate(10, arrow_dset %>% 
  group_by(fluxo) %>% 
  summarise(vl_frete = sum(vl_frete)) %>% 
  collect %>% 
  filter(fluxo=="imp") %>%
  pull(vl_frete))

 [1]        NA        NA 154149785        NA 154149785        NA 154149785 154149785
 [9] 154149785 154149785

I can't replicate, I get 10 NA values each time. The variable `vl_frete` appears to have > 35k NA values. If you want to `sum` without taking NA values into account, use `sum(vl_frete, na.rm = TRUE)`. — Axeman, Jun 13 '23 at 16:26
Did you create the parquet files yourself? I'd check that the original data is of the expected data type. That might be the reason you end up with NA — fmic_, Jun 13 '23 at 16:33
@Axeman included a count of missing to show there is no missing vl_frete when fluxo=="imp" — Eduardo Leoni, Jun 13 '23 at 16:41
@fmic_ Yes, I did. The weird part is it giving NAs at random, i.e. each run gives a different result! — Eduardo Leoni, Jun 13 '23 at 16:42
I don't know why, but results depend on when you collect. Compare `arrow_dset %>% group_by(fluxo) %>% summarise(n_NA = sum(is.na(vl_frete)), mean = mean(vl_frete)) %>% collect()`, with `arrow_dset %>% collect() %>% group_by(fluxo) %>% summarise(n_NA = sum(is.na(vl_frete)), mean = mean(vl_frete))` — Axeman, Jun 13 '23 at 16:51
@Axeman "I don't know why, but results depend on when you collect. " yup! — Eduardo Leoni, Jun 13 '23 at 16:53
I don't know much about arrow, but this seems like a bug to me. — Axeman, Jun 13 '23 at 17:02
@Axeman I filed a bug report https://github.com/apache/arrow/issues/36053. Let´s see what happens. Thank you for your help! — Eduardo Leoni, Jun 13 '23 at 17:39
after setting the seed with ```set.seed(123)```, the results are always the same. Does any of this involve random number generation Eduardo? Because the documentation says the ```replicate``` function is "usually involve[s] random number generation" — Mark, Jun 16 '23 at 00:13
also, swapping ```summarise(vl_frete=sum(vl_frete)) %>%``` and ```collect %>%``` results in it giving the same answer every time too. — Mark, Jun 16 '23 at 00:34
Not an answer to the bug, but notably if you switch the order of operations here to be more efficient, you don't run into the bug. Specifically, `filter` first so as to work with as few observations as possible from the files, then `group_by` and `summarise`, then `collect` last so all the previous happens via Arrow and not R. — alistaire, Jun 20 '23 at 19:29
@alistaire the context was summarizing by "fluxo", so it wouldn't make sense to filter first. I wasn't sure it was a bug, so I posted here. It was helpful to get some confirmation. Thanks to all that looked into the problem! — Eduardo Leoni, Jun 21 '23 at 02:08
If you're filtering to one value of `fluxo` later, there's no point in calculating the summaries for all the other groups. — alistaire, Jun 22 '23 at 15:51

score -1 · Answer 1 · answered Jun 16 '23 at 10:52

-1

Like mentioned in the comments, the order of collect does somehow matter. It seems that you should first collect and then use your calculations like this:

library(arrow)
library(dplyr)

arrow_dset%>%
  count(fluxo, vl_frete_miss=is.na(vl_frete))%>%collect()
#> # A tibble: 2 × 3
#>   fluxo vl_frete_miss     n
#>   <chr> <lgl>         <int>
#> 1 exp   TRUE          35546
#> 2 imp   FALSE         42332
  
set.seed(123)
replicate(10,  arrow_dset%>% 
            group_by(fluxo) %>% 
            summarise(vl_frete=sum(vl_frete)) %>% 
            collect %>% 
            filter(fluxo=="imp") %>%
            pull(vl_frete))
#>  [1] 154149785 154149785        NA 154149785 154149785        NA 154149785
#>  [8] 154149785 154149785 154149785

set.seed(123)
replicate(10,  arrow_dset %>% 
            collect %>% 
            group_by(fluxo) %>% 
            summarise(vl_frete=sum(vl_frete)) %>%
            filter(fluxo=="imp") %>%
            pull(vl_frete))
#>  [1] 154149785 154149785 154149785 154149785 154149785 154149785 154149785
#>  [8] 154149785 154149785 154149785

^{Created on 2023-06-16 with reprex v2.0.2}

answered Jun 16 '23 at 10:52

Quinten

35,235
5
20
53

1

I was unsure if I should have answered the question. I didn't because I thought asking them to set a seed or move around the collect() statement wasn't really answering their question- namely, why is happening in the first place? Or is it okay to simply find a way around their problem? New here, so idk! – Mark Jun 16 '23 at 22:49
This doesn't mean much; if you `collect` then you're just operating on a normal data.frame. This is obviously an Arrow issue. – alistaire Jun 20 '23 at 19:31
Hi @alistaire, Thank you! May I ask you why this doesn’t mean much. It seems that the order of collecting the data does effect the output, right? – Quinten Jun 20 '23 at 19:35
1

Right, but collecting immediately means you're just reading the parquet files and using dplyr on the resulting data.frame, not building an executing an Arrow query (using Acero) on the files. This is an Acero bug, not a parquet reading or dplyr one. – alistaire Jun 20 '23 at 23:42

R with package arrow sums results in NA at random

1 Answers1