2

When using package arrow with R and dplyr, summarizing a variable results in NA at random, while there is no NA in the data.

Example:

library(arrow)
library(dplyr)

td <- tempdir()
tzip <- file.path(td, "reprex.zip")
download.file("https://drive.google.com/uc?export=download&id=1-KefpiALDtUg0PrCUgpMAaE0903jVWWm", destfile = tzip)
unzip(tzip, exdir = td)
tlink <- file.path(td, "co_ano_mes=1997-01-01")

arrow_dset <- arrow::open_dataset(
     tlink,
     format = "parquet"
)

arrow_dset%>%
count(fluxo, vl_frete_miss=is.na(vl_frete)) %>% 
  collect()
fluxo vl_frete_miss     n   
<chr> <lgl>         <int>  
exp   TRUE          35546  
imp   FALSE         42332
replicate(10, arrow_dset %>% 
  group_by(fluxo) %>% 
  summarise(vl_frete = sum(vl_frete)) %>% 
  collect %>% 
  filter(fluxo=="imp") %>%
  pull(vl_frete))
 [1]        NA        NA 154149785        NA 154149785        NA 154149785 154149785
 [9] 154149785 154149785
alistaire
  • 42,459
  • 4
  • 77
  • 117
Eduardo Leoni
  • 8,991
  • 6
  • 42
  • 49
  • 1
    I can't replicate, I get 10 NA values each time. The variable `vl_frete` appears to have > 35k NA values. If you want to `sum` without taking NA values into account, use `sum(vl_frete, na.rm = TRUE)`. – Axeman Jun 13 '23 at 16:26
  • Did you create the parquet files yourself? I'd check that the original data is of the expected data type. That might be the reason you end up with NA – fmic_ Jun 13 '23 at 16:33
  • @Axeman included a count of missing to show there is no missing vl_frete when fluxo=="imp" – Eduardo Leoni Jun 13 '23 at 16:41
  • @fmic_ Yes, I did. The weird part is it giving NAs at random, i.e. each run gives a different result! – Eduardo Leoni Jun 13 '23 at 16:42
  • 1
    I don't know why, but results depend on when you collect. Compare `arrow_dset %>% group_by(fluxo) %>% summarise(n_NA = sum(is.na(vl_frete)), mean = mean(vl_frete)) %>% collect()`, with `arrow_dset %>% collect() %>% group_by(fluxo) %>% summarise(n_NA = sum(is.na(vl_frete)), mean = mean(vl_frete))` – Axeman Jun 13 '23 at 16:51
  • @Axeman "I don't know why, but results depend on when you collect. " yup! – Eduardo Leoni Jun 13 '23 at 16:53
  • 1
    I don't know much about arrow, but this seems like a bug to me. – Axeman Jun 13 '23 at 17:02
  • 1
    @Axeman I filed a bug report https://github.com/apache/arrow/issues/36053. Let´s see what happens. Thank you for your help! – Eduardo Leoni Jun 13 '23 at 17:39
  • 1
    after setting the seed with ```set.seed(123)```, the results are always the same. Does any of this involve random number generation Eduardo? Because the documentation says the ```replicate``` function is "usually involve[s] random number generation" – Mark Jun 16 '23 at 00:13
  • 1
    also, swapping ```summarise(vl_frete=sum(vl_frete)) %>%``` and ```collect %>%``` results in it giving the same answer every time too. – Mark Jun 16 '23 at 00:34
  • Not an answer to the bug, but notably if you switch the order of operations here to be more efficient, you don't run into the bug. Specifically, `filter` first so as to work with as few observations as possible from the files, then `group_by` and `summarise`, then `collect` last so all the previous happens via Arrow and not R. – alistaire Jun 20 '23 at 19:29
  • @alistaire the context was summarizing by "fluxo", so it wouldn't make sense to filter first. I wasn't sure it was a bug, so I posted here. It was helpful to get some confirmation. Thanks to all that looked into the problem! – Eduardo Leoni Jun 21 '23 at 02:08
  • If you're filtering to one value of `fluxo` later, there's no point in calculating the summaries for all the other groups. – alistaire Jun 22 '23 at 15:51

1 Answers1

-1

Like mentioned in the comments, the order of collect does somehow matter. It seems that you should first collect and then use your calculations like this:

library(arrow)
library(dplyr)

arrow_dset%>%
  count(fluxo, vl_frete_miss=is.na(vl_frete))%>%collect()
#> # A tibble: 2 × 3
#>   fluxo vl_frete_miss     n
#>   <chr> <lgl>         <int>
#> 1 exp   TRUE          35546
#> 2 imp   FALSE         42332
  
set.seed(123)
replicate(10,  arrow_dset%>% 
            group_by(fluxo) %>% 
            summarise(vl_frete=sum(vl_frete)) %>% 
            collect %>% 
            filter(fluxo=="imp") %>%
            pull(vl_frete))
#>  [1] 154149785 154149785        NA 154149785 154149785        NA 154149785
#>  [8] 154149785 154149785 154149785

set.seed(123)
replicate(10,  arrow_dset %>% 
            collect %>% 
            group_by(fluxo) %>% 
            summarise(vl_frete=sum(vl_frete)) %>%
            filter(fluxo=="imp") %>%
            pull(vl_frete))
#>  [1] 154149785 154149785 154149785 154149785 154149785 154149785 154149785
#>  [8] 154149785 154149785 154149785

Created on 2023-06-16 with reprex v2.0.2

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • 1
    I was unsure if I should have answered the question. I didn't because I thought asking them to set a seed or move around the collect() statement wasn't really answering their question- namely, why is happening in the first place? Or is it okay to simply find a way around their problem? New here, so idk! – Mark Jun 16 '23 at 22:49
  • This doesn't mean much; if you `collect` then you're just operating on a normal data.frame. This is obviously an Arrow issue. – alistaire Jun 20 '23 at 19:31
  • Hi @alistaire, Thank you! May I ask you why this doesn’t mean much. It seems that the order of collecting the data does effect the output, right? – Quinten Jun 20 '23 at 19:35
  • 1
    Right, but collecting immediately means you're just reading the parquet files and using dplyr on the resulting data.frame, not building an executing an Arrow query (using Acero) on the files. This is an Acero bug, not a parquet reading or dplyr one. – alistaire Jun 20 '23 at 23:42