I'm trying to read data in R from the hdfs. One thing I'm struggling with when using sparklyr is deciphering the error messages ...because I am not a java programmer.
Consider this example:
DO THIS IN R
create abalone dataframe - abalone is a…
I'm facing a problem trying to write 2 dataset using sparklyr::spark_write_csv(). This is my configuration:
# Configure cluster
config <- spark_config()
config$spark.yarn.keytab <- "mykeytab.keytab"
config$spark.yarn.principal <-…
I am trying to load a dataset with a million rows and 1000 columns with sparklyr.
I am running Spark on a very big cluster at work. Still the size of the data seems to be too big. I have tried two different approaches:
This is the dataset:…
The following code calculates a set of regression coefficients for each of three dependent variables regressed on the set of six independent variable for each of two groups and it works fine.
library(tidyverse)
library(broom)
n <- 20
df4 <-…
I'm working with some tables that I want to join, for that I use sparklyr (due to tables size) with left_joint of dplyr.
here is the code sample :
query.1 <- left_join(pa11, pa12, by = c("CODIGO_HAB_D","ID_EST","ID_ME","ID_PARTE_D","ID_PAR",…
I am currently working on Rstudio over a rhel cluster.
I use spark 2.0.2 over a yarn client & have installed the following versions of sparklyr & dplyr
sparklyr_0.5.4 ;
dplyr_0.5.0
A simple test on the following lines results in error
data =…
I am attempting to use the sparklyr package to connect to an existing MS SQL database to query data faster than is possible with the RODBC package. Currently, I am able to successfully query the database using RODBC::odbcConnect() and…
I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
and a handle named xy_df that is connected to this table.
I want to invoke the selectExpr function…
I am using Spark from R, via sparklyr package to run a regression on a huge dataset (>500mill obs). But I wanted a weighted regression and I can't seem to find the correct syntax / function to do that.
Currently I am doing…
I have this Spark table:
xydata
y: num 11.00 22.00 33.00 ...
x0: num 1.00 2.00 3.00 ...
x1: num 2.00 3.00 4.00 ...
...
x788: num 2.00 3.00 4.00 ...
And this dataframe in R environment:
penalty
p: num 1.23 2.34 3.45 ...
with the number of rows in…
I'm trying to read a csv file into strudio with sparklyr package in a google compute cluster. This is the configuration:
Test Spark framework
install.packages("sparklyr")
install.packages("dplyr")
library(sparklyr)
spark_install(version =…
I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length
AA105LONDEN 03162017045262017 16953563ABCDEF
and I would like to separate each row into multiple columns based on the lines…