6

Situation

I used to work on Rstudio with data.table instead of plyr or sqldf because it's really fast. Now, i'm working on sparkR on an azure cluster and i'd like to now if i can use data.table on my spark Data frames and if it's faster than sql ?

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Orhan Yazar
  • 909
  • 7
  • 19
  • There is a `sparklyr` package by Rstudio which allows you to use a spark dataframe with `dplyr`. – David Arenburg Nov 10 '17 at 12:18
  • 1
    Yes, @DavidArenburg, but can one use the data.table package and its idioms to analyze spark dataframes, or must one use dplyr? – Avraham Jan 17 '18 at 16:13
  • 1
    @Avraham data.tables author works at [h2o.ai](https://www.h2o.ai/). It is a distributed system (based on Spark IIRC) that undarstands R syntax and has a lot of data.table features built in (thanks to Matt) such as distributed binary search (see [this](https://www.youtube.com/watch?v=5X7h1rZGVs0)). Other than that, I'm not sure how you would work with data.table on a Spark data.frame unless you will collect it to one node. – David Arenburg Jan 17 '18 at 18:35

1 Answers1

5

It is not possible. SparkDataFrames are Java objects with a thin R interface. While it is possible to use worker side R in some limited cases (dapply, gapply) there is no use for data.table there.

  • 2
    Thank you, but is it faster to keep data frames and work with data.table or to use SparkDataFrames and work with sparklyr or sparkSQL ?? – Orhan Yazar Nov 15 '17 at 08:33