1

I get different results when using full_join on tibble and on arrow_table. Maybe somebody can give a hand on what is going on?

library(arrow)
library(dplyr)

xa1 <- arrow_table(x = 1L)
xa2 <- arrow_table(x = 2L)

x1 <- tibble(x = 1L)
x2 <- tibble(x = 2L)

full_join(xa1,xa2,on = c("x")) %>%  collect() %>% compute()
full_join(x1,x2)

# A tibble: 2 × 1
x
<int>
1     1
2    NA
full_join(x1,x2)
Joining, by = "x"
# A tibble: 2 × 1
x
<int>
1     1
2     2
Vitalijs
  • 938
  • 7
  • 18

1 Answers1

0

There is no on argument in dplyr::.*_join. Usage according to ?dplyr::full_join is

full_join( x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), ..., keep = NULL )

on is a data.table join argument. We need by here

library(arrow)
library(dplyr)
full_join(xa1, xa2, by = "x") %>%
     collect() %>% 
     compute()

-output

# A tibble: 2 × 1
      x
  <int>
1     1
2     2

By looking at the methods and source code

> methods("full_join")
[1] full_join.arrow_dplyr_query* full_join.ArrowTabular*      full_join.data.frame*        full_join.Dataset*           full_join.RecordBatchReader*
> getAnywhere(full_join.ArrowTabular)
function (x, y, by = NULL, copy = FALSE, suffix = c(".x", ".y"), 
    ..., keep = FALSE) 
{
    query <- do_join(x, y, by, copy, suffix, ..., keep = keep, 
        join_type = "FULL_OUTER")
    if (!keep) {
        query$selected_columns <- post_join_projection(names(x), 
            names(y), handle_join_by(by, x, y), suffix)
    }
    query
}

by is used in the functions that are called inside

akrun
  • 874,273
  • 37
  • 540
  • 662
  • it is strange. Once I had this issue in arrow 9. Once I installed version 10. Everything is fine. Any clue why it does not complain about on? – Vitalijs Nov 19 '22 at 21:43
  • This is the thing, that in the previous version I was getting that result. – Vitalijs Nov 19 '22 at 21:46
  • seems that there is a bug in the version 9.0.0.20220828 with ```by``` get exactly the same result – Vitalijs Nov 19 '22 at 21:49
  • @Vitalijs, note that `full_join(x1,x2,on = c("x"))` works, but its message of `Joining, by = "x"` suggests that it ignored `on=` and chose to infer its `by=` (which happened to be the same, only because the set of common column names set it to be that way). – r2evans Nov 19 '22 at 21:51
  • @akrun, @r2evans. In the arrow version ```9.0.0.20220828``` I get that result even if I use ```by``` – Vitalijs Nov 19 '22 at 21:52
  • My guess is that it is inferring based on the shared column names. That's the default behavior. If you try doing that in arrow-9 with two frames where there are more shared columns _and you do not want to use all of them_, you should start seeing more differences. – r2evans Nov 19 '22 at 21:53
  • @r2evans just to clarify, is this a bug or a feature? Is this solved in version 10? – Vitalijs Nov 19 '22 at 21:54
  • I think it was coincidence that it worked previously. `dplyr::full_join` will (I believe) always support a missing `by=`, though I think that's bad practice. – r2evans Nov 19 '22 at 22:23
  • 1
    There was a bug in full join in Arrow: https://issues.apache.org/jira/browse/ARROW-16897. It was fixed in 10.0.0 so this seems likely – Pace Nov 20 '22 at 07:14