Questions tagged [sparkr]

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.

SparkR is a package that provides a light-weight frontend to use from R.

SparkR exposes the Spark API through the RDD class and allows users to interactively run jobs from the R shell on a cluster.

SparkR exposes the RDD API of Spark as distributed lists in R.

Related Packages:

References:

796 questions
7
votes
1 answer

How to check for intersection of two DataFrame columns in Spark

Using either pyspark or sparkr (preferably both), how can I get the intersection of two DataFrame columns? For example, in sparkr I have the following DataFrames: newHires <- data.frame(name = c("Thomas", "George", "George", "John"), …
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
7
votes
1 answer

Not able to retrieve data from SparkR created DataFrame

I have below simple SparkR program, which is to create a SparkR DataFrame and retrieve/collect data from it. Sys.setenv(HADOOP_CONF_DIR = "/etc/hadoop/conf.cloudera.yarn") Sys.setenv(SPARK_HOME =…
7
votes
2 answers

Using apply functions in SparkR

I am currently trying to implement some functions using sparkR version 1.5.1. I have seen older (version 1.3) examples, where people used the apply function on DataFrames, but it looks like this is no longer directly available. Example: x =…
bmcMunich
  • 71
  • 3
7
votes
4 answers

SparkR Error in sparkR.init(master="local") in RStudio

I have installed the SparkR package from Spark distribution into the R library. I can call the following command and it seems to work properly: library(SparkR) However, when I try to get the Spark context using the following code, sc <-…
Umesh K
  • 13,436
  • 25
  • 87
  • 129
7
votes
3 answers

How to read csv into sparkR ver 1.4?

As a new version of spark (1.4) was released there appeared to be a nice frontend interfeace to spark from R package named sparkR. On the documentation page of R for spark there is a command that enables to read json files as an RDD objects people…
Marcin
  • 7,834
  • 8
  • 52
  • 99
6
votes
0 answers

Losing columns names when writing sparkdataframe with sparkR write.df

Context I'm working on an azure HDI R server cluster with rstudio and sparkR package. I'm reading file, modifying it and then i want to write it with write.df, but the problem is that when i write the file, my column names disappear. My code is the…
Orhan Yazar
  • 909
  • 7
  • 19
6
votes
1 answer

SparkR DataFrame partitioning issue

In my R script, I have a SparkDataFrame of two columns (time, value) containing data for four different months. Because of the fact that I need to apply my function to an each month separately, I figured I would repartition it into four partitions…
6
votes
1 answer

Is it possible to use data.table on SparkR with Sparkdataframes?

Situation I used to work on Rstudio with data.table instead of plyr or sqldf because it's really fast. Now, i'm working on sparkR on an azure cluster and i'd like to now if i can use data.table on my spark Data frames and if it's faster than sql ?
Orhan Yazar
  • 909
  • 7
  • 19
6
votes
2 answers

Getting last value of group in Spark

I have a SparkR DataFrame as shown below: #Create R data.frame custId <- c(rep(1001, 5), rep(1002, 3), 1003) date <- c('2013-08-01','2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-02-01','2014-03-01','2014-04-01','2014-04-01') desc <-…
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
6
votes
3 answers

Get mode (most often) value in Spark column with groupBy

I have a SparkR DataFrame and I want to get the mode (most often) value for each unique name. How can I do this? There doesn't seem to be a built-in mode function. Either a SparkR or PySpark solution will do. # Create DF df <- data.frame(name =…
Gaurav Bansal
  • 5,221
  • 14
  • 45
  • 91
6
votes
1 answer

how to list spark-packages added to the spark context?

Is it possible to list what spark packages have been added to the spark session? The class org.apache.spark.deploySparkSubmitArguments has a variable for the packages: var packages: String = null Assuming this is a list of the spark packages, is…
Chris Snow
  • 23,813
  • 35
  • 144
  • 309
6
votes
1 answer

How to do map and reduce in SparkR

How do I do map and reduce operations using SparkR? All I can find is stuff about SQL queries. Is there a way to do map and reduce using SQL?
Matthew Jones
  • 944
  • 9
  • 17
5
votes
0 answers

SparkR code fails if Apache Arrow is enabled

I am running gapply function on SparkRDataframe which looks like below df<-gapply(sp_Stack, function(key,e) { Sys.setlocale('LC_COLLATE','C') suppressPackageStartupMessages({ library(Rcpp) library(Matrix) …
5
votes
2 answers

Efficient way to read and write data into files over a loop using R

I am trying to read and write data into files at each time step. To do this, I am using the package h5 to store large datasets but I find that my code using the functions of this package is running slowly. I am working with very large datasets. So,…
Nell
  • 559
  • 4
  • 20
5
votes
1 answer

Find variables making Primary Key using SparkR

Here is my toy data: df <- tibble::tribble( ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7, "A", "C", 1L, 5L, "AA", "AB", 1L, "A", "C", 2L, 5L, "BB", "AC", 2L, "A", "D", 1L, 7L, "AA", "BC", 2L, …
Geet
  • 2,515
  • 2
  • 19
  • 42
1
2
3
53 54