I want to save my Spark DataFrame into directory using spark_write_* function like this:
spark_write_csv(df, "file:///home/me/dir/")
but if the directory is already there I will get error:
ERROR: org.apache.spark.sql.AnalysisException: path…
I am trying to perform linear regression using SparklyR on an EMR cluster, and receiving the error below. The connection to Spark seems fine, and I have tried using several different datasets, but they all result in the same error. I am looking…
I am trying to use spark_apply on a spark cluster to calculate kmeans on data grouped by two columns. The data is queried from Hive and looks like this
> samplog1
# Source: lazy query [?? x 6]
# Database: spark_connection
…
I have recently created an R package that makes use of sparklyr possibilities. I invoke the package main function from opencpu and pass as argument an url with all my data as a stream. Data stream is successfully analysed in a distributed way via…
I'm pretty new to Shiny and Spark.
I want to deploy a ShinyApp with a spark connection. Everything works how it should when I just hit RunApp, but whenever I try to publish it, I get the error: "Error in value[3L] :
SPARK_HOME directory…
I am relatively new as active user to the forum, but have to thank you all first your contributions because I have been looking for answers since years...
Today, I have a question that nobody has solved or I am not able to find...
I am trying to…
Similar to this code snippet that lists the memory usage of objects in the local R environment, is there a similar command to see the memory of DataFrames available in a Spark connection? E.g. Something similar to src_tbls(sc), that currently only…
I'm having issues trying to connect using sparklyr.
install.packages('sparklyr')
require(sparklyr)
spark_install()
sc <- spark_connect(master = "local")
Ive had a few errors I worked through like my dplyr version not being up to date, and something…
I'm trying to pass a custom R function inside spark_apply but keep running into issues and cant figure out what some of the errors mean.
library(sparklyr)
sc <- spark_connect(master = "local")
perf_df <- data.frame(predicted = c(5, 7, 20),
…
I want to calculate the correlation matrix of a Spark table in R, I tried using cor() has in R, but it does not work, here the code:
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
flights_tbl <- copy_to(sc,…
Context: I have a large table with logon times. I want to calculate a rolling count of logons within a specified period (e.g. 3600 sec).
In SQL/HQL i would specify this as:
SELECT id, logon_time, COUNT(*) OVER(
PARTITION BY id ORDER BY logon_time…
I am using RStudio. Installed a local version of Spark, run a few things, quite happy. Now I am trying to read my actual data from a Cluster, using RStudio Server and a standalone version of Spark. Data is in Cassandra, and I do not know how to…
I'm trying to follow the simple guide on SparklyR, but it throws me errors right at the very beginning. I install SparklyR and a local version of Spark as written in the guide:
library("sparklyr")
spark_install(version="1.6.2")
I then open a…
Have tried this combination without Cloudera but failed.
With Cloudera, I tried following the tutorial sparklyr: a test drive on YARN
Wonder if anyone has success without need to…
I'm trying to execute this RSparkling example on an offline CDH-5.10.2 cluster. My environment is:
Spark 1.6.0;
sparklyr 0.6.2;
h2o 3.10.5.2;
rsparkling 0.2.1.
I use custom Sparkling Water JAR which is basically 1.6.12 with this PR…