Questions tagged [sparklyr]

sparklyr is an alternative R interface for Apache Spark

sparklyr provides an alternative to interface for built on top of .

External links:

784 questions
8
votes
5 answers

spark: java.io.IOException: No space left on device [again!]

I am getting the java.io.IOException: No space left on device that occurs after running a simple query in sparklyr. I use both last versions of Spark (2.1.1) and Sparklyr df_new <-spark_read_parquet(sc, "/mypath/parquet_*", name = "df_new", memory =…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
8
votes
3 answers

Connect sparklyr to remote spark connection

I would like to connect my local desktop RStudio session to a remote spark session via sparklyr. When you go to add a new connection in the sparklyr ui tab in RStudio and choose cluster is says that you have to be running on the cluster, or have a…
Jim Crozier
  • 1,378
  • 2
  • 16
  • 28
7
votes
2 answers

How to set SPARK_LOCAL_DIRS parameter using spark-env.sh file

I am trying to change the location spark writes temporary files to. Everything I've found online says to set this by setting the SPARK_LOCAL_DIRS parameter in the spark-env.sh file, but I am not having any luck with the changes actually taking…
jay
  • 517
  • 1
  • 7
  • 19
7
votes
1 answer

How to train a ML model in sparklyr and predict new values on another dataframe?

Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), …
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
7
votes
1 answer

"GC overhead limit exceeded" on cache of large dataset into spark memory (via sparklyr & RStudio)

I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into…
renegademonkey
  • 457
  • 1
  • 7
  • 18
7
votes
0 answers

Sparklyr "embedded nul in string" when collecting

In R I have a spark connection and a DataFrame as ddf. library(sparklyr) library(tidyverse) sc <- spark_connect(master = "foo", version = "2.0.2") ddf <- spark_read_parquet(sc, name='test', path="hdfs://localhost:9001/foo_parquet") Since it's not a…
Tim
  • 2,000
  • 4
  • 27
  • 45
7
votes
1 answer

Changing column data type to factor with sparklyr

I am pretty new to Spark and am currently using it using the R API through sparkly package. I created a Spark data frame from hive query. The data types are not specified correctly in the source table and I'm trying to reset the data type by…
b396958
  • 73
  • 1
  • 4
6
votes
1 answer

what is the difference between dplyr::copy_to and sparklyr::sdf_copy_to?

I am using the library sparklyr to interact with 'spark'. There are two functions for put a data frame in a spark context. Such functions are 'dplyr::copy_to' and 'sparklyr::sdf_copy_to'. What is the difference and when is recommended to use one…
6
votes
1 answer

Writing a function to use with spark_apply() from sparklyr

test <- data.frame('prod_id'= c("shoe", "shoe", "shoe", "shoe", "shoe", "shoe", "boat", "boat","boat","boat","boat","boat"), 'seller_id'= c("a", "b", "c", "d", "e", "f", "a","g", "h", "r", "q", "b"), 'Dich'= c(1, 0,…
Kreitz Gigs
  • 369
  • 1
  • 9
6
votes
1 answer

Extract and Visualize Model Trees from Sparklyr

Does anyone have any advice about how to convert the tree information from sparklyr's ml_decision_tree_classifier, ml_gbt_classifier, or ml_random_forest_classifier models into a.) a format that can be understood by other R tree-related libraries…
RealViaCauchy
  • 237
  • 1
  • 10
6
votes
3 answers

Find out if 2 tables (`tbl_spark`) are equal without collecting them using sparklyr

Consider there are 2 tables or table references in spark which you want to compare, e.g. to ensure that your backup worked correctly. Is there a possibility to do that remote in spark? Because it's not useful to copy all the data to R using…
nachti
  • 1,086
  • 7
  • 20
6
votes
1 answer

Sparklyr ignoring line delimiter

I'm trying to read a .csv of 2GB~ (5mi lines) in sparklyr with: bigcsvspark <- spark_read_csv(sc, "bigtxt", "path", delimiter = "!", infer_schema = FALSE, …
Jader Martins
  • 759
  • 6
  • 26
6
votes
1 answer

How to use a predicate while reading from JDBC connection?

By default, spark_read_jdbc() reads an entire database table into Spark. I've used the following syntax to create these connections. library(sparklyr) library(dplyr) config <- spark_config() config$`sparklyr.shell.driver-class-path` <-…
Jake Russ
  • 683
  • 1
  • 9
  • 19
6
votes
3 answers

sparklyr write data to hdfs or hive

I tried using sparklyr to write data to hdfs or hive , but was unable to find a way . Is it even possible to write a R dataframe to hdfs or hive using sparklyr ? Please note , my R and hadoop are running on two different servers , thus I need a way…
Rahul
  • 71
  • 1
  • 4
6
votes
3 answers

Access table in other than default scheme (database) from sparklyr

After I managed it to connect to our (new) cluster using sparklyr with yarn-client method, now I can show just the tables from the default scheme. How can I connect to scheme.table? Using DBI it's working e.g. with the following line: dbGetQuery(sc,…
nachti
  • 1,086
  • 7
  • 20
1
2
3
52 53