Is there a way to replicate the rows of a Spark's dataframe using the functions of sparklyr/dplyr?
sc <- spark_connect(master = "spark://####:7077")
df_tbl <- copy_to(sc, data.frame(row1 = 1:3, row2 = LETTERS[1:3]), "df")
This is the desired…
I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.
Here is a simple example that I think should work but doesn't:
library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"),
…
I'm Using Spark 2.0.2 in combination with sparklyr 0.5.4-9004 on RStudio, in a windows server.
Every once in a while, when I try to collect, read or write data from the spark server, I'm getting the following error:
Error in UseMethod("invoke") :
…
I have two tables that I want to do a full join using dplyr, but I don't want it to drop any of the columns. Per the documentation and my own experience it is only keeping the join column for the left hand side. This is a problem when you have a row…
I am getting heap space errors on even fairly small datasets. I can be sure that I'm not running out of system memory. For example, consider a dataset containing about 20M rows and 9 columns, and that takes up 1GB on disk. I am playing with it on a…
I'm trying to connect R to Spark using Sparklyr.
I followed the tutorial from rstudio blog
I tried installing sparklyr using
install.packages("sparklyr") which went fine but In another post, I saw that there was a bug in sparklyr_0.4 version. So I…
I'm new to Spark and GCP. I've tried to connect to it with
sc <- spark_connect(master = "IP address")
but it obviously couldn't work (e.g. there is no authentication).
How should I do that? Is it possible to connect to it from outside Google Cloud?
Is the sparklyr R package able to connect to YARN-managed hadoop clusters? This doesn't seem to be documented in the cluster deployment documentation. Using the SparkR package that ships with Spark it is possible by doing:
# set R environment…
I would like to extract the last part of the string (after the last forward slash).
When I use the following code it fails with the error:
library(sparklyr)
library(tidyverse)
sc <- spark_connect(method = "databricks")
tibble(my_string =…
I have tried below code & its combinations in order to read all files given in a S3 folder , but nothing seems to be working .. Sensitive information/code is removed from the below script. There are 6 files each with 6.5 GB .
#Spark…
Note - I have referred answer, but although the data is un-nested but I could not convert data into csv file format.
I want to flatten the data of different data types by using explode functionality. The dataset contains arrays and structure. I want…
Spark 2.0 with Hive
Let's say I am trying to write a spark dataframe, irisDf to orc and save it to the hive metastore
In Spark I would do that like this,
irisDf.write.format("orc")
.mode("overwrite")
.option("path", "s3://my_bucket/iris/")
…
I tried the following to row bind two Spark dataframes but I gave an error message
library(sparklyr)
library(dplyr)
sc <- spark_connect(master = "local")
iris_tbl <- copy_to(sc, iris)
iris_tbl1 <- copy_to(sc, iris, "iris1")
iris_tbl2 =…
I have looked all over and I'm still unable to get those three dplyr functions to work within sparklyr. I have a reproducible example below. First, some session info:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under:…