Questions tagged [sparklyr]

sparklyr is an alternative R interface for Apache Spark

sparklyr provides an alternative to interface for built on top of .

External links:

784 questions
5
votes
3 answers

R - How to replicate rows in a spark dataframe using sparklyr

Is there a way to replicate the rows of a Spark's dataframe using the functions of sparklyr/dplyr? sc <- spark_connect(master = "spark://####:7077") df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df") This is the desired…
Igor
  • 913
  • 1
  • 8
  • 18
5
votes
1 answer

Sparklyr: Use group_by and then concatenate strings from rows in a group

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group. Here is a simple example that I think should work but doesn't: library(sparkylr) d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), …
Maggie
  • 357
  • 4
  • 11
5
votes
0 answers

R - Unable to collect data from Spark using Sparklyr

I'm Using Spark 2.0.2 in combination with sparklyr 0.5.4-9004 on RStudio, in a windows server. Every once in a while, when I try to collect, read or write data from the spark server, I'm getting the following error: Error in UseMethod("invoke") : …
Igor
  • 913
  • 1
  • 8
  • 18
5
votes
4 answers

Is it possible to do a full join in dplyr and keep all the columns used in the join?

I have two tables that I want to do a full join using dplyr, but I don't want it to drop any of the columns. Per the documentation and my own experience it is only keeping the join column for the left hand side. This is a problem when you have a row…
Dave Kincaid
  • 3,970
  • 3
  • 24
  • 32
5
votes
2 answers

Disable hive support in sparklyr

Is there any way to disable the hive support in sparklyr? Just like in SparkR: sparkR.session(master="local[*]", enableHiveSupport=FALSE)
Raphael Sampaio
  • 148
  • 2
  • 11
5
votes
1 answer

Running out of heap space in sparklyr, but have plenty of memory

I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a…
David Bruce Borenstein
  • 1,655
  • 2
  • 19
  • 34
5
votes
2 answers

Trying to Connect R to Spark using Sparklyr

I'm trying to connect R to Spark using Sparklyr. I followed the tutorial from rstudio blog I tried installing sparklyr using install.packages("sparklyr") which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I…
Rakesh Kumar
  • 161
  • 2
  • 9
5
votes
1 answer

How may I connect Google Dataproc cluster from Sparklyr?

I'm new to Spark and GCP. I've tried to connect to it with sc <- spark_connect(master = "IP address") but it obviously couldn't work (e.g. there is no authentication). How should I do that? Is it possible to connect to it from outside Google Cloud?
5
votes
4 answers

Can sparklyr be used with spark deployed on yarn-managed hadoop cluster?

Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing: # set R environment…
Matt Pollock
  • 1,063
  • 10
  • 26
4
votes
1 answer

Spark regexp_extract() fails - Regex group count is 0, but the specified group index is 1

I would like to extract the last part of the string (after the last forward slash). When I use the following code it fails with the error: library(sparklyr) library(tidyverse) sc <- spark_connect(method = "databricks") tibble(my_string =…
Piotr K
  • 943
  • 9
  • 20
4
votes
1 answer

How to read all files in S3 folder/bucket using sparklyr in R?

I have tried below code & its combinations in order to read all files given in a S3 folder , but nothing seems to be working .. Sensitive information/code is removed from the below script. There are 6 files each with 6.5 GB . #Spark…
Yogesh Kumar
  • 609
  • 6
  • 22
4
votes
1 answer

How to explode the dataset in JSON file by using explode functionality in R?

Note - I have referred answer, but although the data is un-nested but I could not convert data into csv file format. I want to flatten the data of different data types by using explode functionality. The dataset contains arrays and structure. I want…
Shree
  • 203
  • 3
  • 22
4
votes
2 answers

sparklyr can I pass format and path options into spark_write_table? or use saveAsTable with spark_write_orc?

Spark 2.0 with Hive Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore In Spark I would do that like this, irisDf.write.format("orc") .mode("overwrite") .option("path", "s3://my_bucket/iris/") …
blakiseskream
  • 338
  • 4
  • 9
4
votes
1 answer

How to row bind two Spark dataframes using sparklyr?

I tried the following to row bind two Spark dataframes but I gave an error message library(sparklyr) library(dplyr) sc <- spark_connect(master = "local") iris_tbl <- copy_to(sc, iris) iris_tbl1 <- copy_to(sc, iris, "iris1") iris_tbl2 =…
xiaodai
  • 14,889
  • 18
  • 76
  • 140
4
votes
1 answer

Use of first, last, nth in sparklyr

I have looked all over and I'm still unable to get those three dplyr functions to work within sparklyr. I have a reproducible example below. First, some session info: R version 3.4.3 (2017-11-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under:…
Hutch3232
  • 408
  • 4
  • 11