Questions tagged [sparklyr]

sparklyr is an alternative R interface for Apache Spark

sparklyr provides an alternative to interface for built on top of .

External links:

784 questions
0
votes
2 answers

count number of unique elements in each columns with dplyr in sparklyr

I'm trying to count the number of unique elements in each column in the spark dataset s. However It seems that spark doesn't recognize tally() k<-collect(s%>%group_by(grouping_type)%>%summarise_each(funs(tally(distinct(.))))) Error:…
StatsBoy
  • 35
  • 5
0
votes
0 answers

How to find pairs of data by timestamp-window & values from different rows in sparklyr?

My test-data looks like this: (it's graph-like) elemuid <- c(1, 2, 3, 4, 5, 6, 7) timestamp <- c("2018-02-10 23:00:00", "2018-02-10 23:01:00", "2018-02-10 22:59:00", "2018-02-10 22:40:00", "2018-02-10 22:39:00", "2018-02-10 22:37:00", "2018-02-10…
user60856839
  • 133
  • 11
0
votes
1 answer

Sparklyr Spark 2.1 generate top n recommendation

R version 3.3.0 (2016-05-03) Sparklyr version ‘0.7.0’ Spark version 2.1 on YARN client I am using Spark framework in R using Sparklyr for generating top-5 recommendations for products which are likely to be sold and their expected quantity using ALS…
0
votes
2 answers

sparklyr spark_read_parquet Reading String Fields as Lists

I have a number of Hive files in parquet format that contain both string and double columns. I can read most of them into a Spark Data Frame with sparklyr using the syntax below: spark_read_parquet(sc, name = "name", path = "path", memory =…
bshelt141
  • 1,183
  • 15
  • 31
0
votes
0 answers

Wrong data when reading with sparklyr

I am using R and sparklyr process some data from Spark. I am reading two parquet files, in sequence, with v1 <- spark_read_parquet(sc, "events","s3n://project/sessions.parquet", memory="true") head(v1) v2 <- spark_read_parquet(sc,…
user2345448
  • 159
  • 2
  • 11
0
votes
0 answers

I want to process tens of thousands of columns using Spark via sparklyr, but I can't

I tried using sdf_pivot() to widen my column with duplicate values into multiple (a very big number) columns. I planned to use these columns as the feature space for training an ML model. Example: I have a language element sequence in one column…
Alexey Burnakov
  • 259
  • 2
  • 14
0
votes
2 answers

How do you access the model parameters in ml_decision_tree in the Sparklyr package?

I have some sample code that is only working on one machine. After some testing, I discovered that the machine that worked was running R 3.4.2 while everything else was running 3.4.3. After some work I discovered that the way you access the…
Bob Wakefield
  • 3,739
  • 4
  • 20
  • 30
0
votes
1 answer

Convert variable as Timestamp in sparklyr

I know similar question has been asked multiple times before but I have tried all those options and still not get desired result. I have a sdf as kl in following format: CONSUMER_ID TimeStamp TimeStamp2
ROY
  • 268
  • 2
  • 11
0
votes
0 answers

Calling any Spark MLlib function from R?

I found this example of calling spark.mllib functions directly from Scala library. I don't get all things here, but anyway is it possible to call any MLlib function (which is not present via, let's say, spaklyr) this way? In particular I am…
Alexey Burnakov
  • 259
  • 2
  • 14
0
votes
1 answer

Error after trying to make a date column from a character column

Using library sparklyr, I try to create a date variable in the Spark dataframe this way (which works in R): # Researching SPARK…
Alexey Burnakov
  • 259
  • 2
  • 14
0
votes
1 answer

Connecting Spark with R studio on Mac OS gives Hive error

I am trying to use Spark in R Studio using the sparklyr library on MacOS. I have installed it using the following commands # Install the sparklyr package install.packages("sparklyr") # Now load the library library(sparklyr) # Install Spark to your…
Regressor
  • 1,843
  • 4
  • 27
  • 67
0
votes
2 answers

How to implement lapply function in R using package "sparklyr"

I am pretty new to Spark, I have tried to look for something on the web but I haven't found anything satisfactory. I have always run parallel computations using the command mclapply and I like its structure (i.e., first parameter used as scrolling…
0
votes
1 answer

How to select the same column of a Spark data frame multiple times in Sparklyr?

I have a Spark data frame sdf. I would like to generate another table with columns of sdf, however those columns can repeat themselves. The following is the desired output. > sdf %>% select(DC1_Y1,DC2_Y1,DC2_Y1) # Source: lazy query [?? x 3] #…
axiom
  • 406
  • 1
  • 4
  • 16
0
votes
0 answers

Using ml_save with R/Spark

I am training some models (random forest) using ml library in Spark, R, and sparklyr. Everything ok, but now I need to save those models, so they can be used to make predictions for new data. I call ml_save(rfW1,w$fileName) where rfW1 is the…
user2345448
  • 159
  • 2
  • 11
0
votes
1 answer

Sparklyr read database table to distributed DF

Hi I am trying to figure out if there is a way to directly read a DB table to a sparkR dataframe. I have rstudio installed on an EMR cluster which has my hive metastore on it. I know I can do the following: library(sparklyr) library(dplyr) sc <-…
user295944
  • 273
  • 4
  • 17