How to row bind two Spark dataframes using sparklyr?

Question

I tried the following to row bind two Spark dataframes but I gave an error message

library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris)
iris_tbl1 <- copy_to(sc, iris, "iris1")

iris_tbl2 = bind_rows(iris_tbl, iris_tbl1)

What's the most efficient way to bind two Spark dataframes together?

zero323 · Accepted Answer · 2018-08-16T15:27:44.740

6

You can use dplyr::union_all

dplyr::union_all(iris_tbl1, iris_tbl1)

or sparklyr::sdf_bind_rows:

sdf_bind_rows(
  iris_tbl %>% select(-Sepal_Length),
  iris_tbl1 %>% select(-Petal_Length)
)

You could also use Spark's own unionByName if schemas are compatible, but the order of columns doesn't match.

sdf_union_by_name <- function(x, y) {
  invoke(spark_dataframe(x), "unionByName", spark_dataframe(y)) %>% 
    sdf_register()
}

sdf_union_by_name(
  iris_tbl %>% select(Sepal_Length, Petal_Length),
  iris_tbl %>% select(Petal_Length, Sepal_Length)
)

edited Aug 16 '18 at 15:27

answered Aug 16 '18 at 10:41

zero323

322,348
103
959
935

What is the reason that you need the `spark_dataframe(...)` within the `invoke`? `x` is already a spark dataframe reference right? – Siete Aug 10 '22 at 08:51

How to row bind two Spark dataframes using sparklyr?

1 Answers1