6

I am using the library sparklyr to interact with 'spark'. There are two functions for put a data frame in a spark context. Such functions are 'dplyr::copy_to' and 'sparklyr::sdf_copy_to'. What is the difference and when is recommended to use one instead of the other?

  • 1
    The `sparklyr`one is implemented for spark data frames (following the RDD concept in a distributed environment), whereas `dplyr` works for R data frames, tibbles, etc...Is this what you are asking? I am not really sure – Sotos May 15 '19 at 14:27
  • This answers the first part of my question The second part is: do they perform the same? In case "yes", what situation is better to use one instead of the other one? – Sergio Marrero Marrero May 15 '19 at 14:39
  • 1
    You can't use either one or the other. You cannot use `dplyr::copy_to` inside spark environment, **UNLESS** you collect your data frames from RDDs to R data frames. Vice versa for `sparklyr` – Sotos May 15 '19 at 14:41
  • So if have two dataframes and I want to copy to the spark environment, there is absolutely no difference between them? I expected something as: is more efficiente the sparklyr version, or something in this way... – Sergio Marrero Marrero May 15 '19 at 14:45
  • If your data frame is small enough to be handled locally (or not distributed) then `dplyr` will be more efficient. The thing about spark is that it is more efficient IF your data set is big enough to be analysed in a distributed env. So If you try any type of analysis on a small data set, it will be more efficient to do it locally using `dplyr` or any other R as per usual – Sotos May 15 '19 at 14:48
  • so for big dataframes, is better the sparklyr version? Actually I came across of many problems trying to upload with dplyr version to spark, a dataframe with 2 millions of observations and just 3 columns. My solution was to split the dataframe in 4 pieces and upload separately, and later binding in one dataframe in spark. Do you think I could avoid this problem using the sparklyr version? – Sergio Marrero Marrero May 15 '19 at 14:53
  • Of course. Just load the entire thing in spark and do the aggregations there. For me, I do all my aggregations in spark (but I use `pyspark` instead of `R`), and then I collect locally and continue in R (or python). – Sotos May 15 '19 at 14:55
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/193406/discussion-between-sergio-marrero-marrero-and-sotos). – Sergio Marrero Marrero May 15 '19 at 15:12

1 Answers1

2

They're the same. I would use copy_to rather than the specialist sdf_copy_to because it is more consistent with other data sources, but that's stylistic.

The function copy_to is a generic from dplyr and works with any data source which implements a dplyr backend.

You can use it with a spark connection because sparklyr implements copy_to.src_spark and copy_to.spark_connection. They are not exposed to the user since you're supposed to use copy_to and let it dispatch to the correct method.

copy_to.src_sparck just calls copy_to.spark_connection:

#> sparklyr:::copy_to.src_spark
function (dest, df, name, overwrite, ...) 
{
    copy_to(spark_connection(dest), df, name, ...)
}
<bytecode: 0x5646b227a9d0>
<environment: namespace:sparklyr>

copy_to.spark_connection just calls sdf_copy_to:

#> sparklyr:::copy_to.spark_connection
function (dest, df, name = spark_table_name(substitute(df)), 
    overwrite = FALSE, memory = TRUE, repartition = 0L, ...) 
{
    sdf_copy_to(dest, df, name, memory, repartition, overwrite, 
        ...)
}
<bytecode: 0x5646b21ef120>
<environment: namespace:sparklyr>

sdf_copy_to follows the package-wide convention of prefixing with "sdf_" the functions related to Spark DataFrames. On the other hand, copy_to is from dplyr and sparklyr provides compatible methods for the convenience of dplyr users.

asachet
  • 6,620
  • 2
  • 30
  • 74