I need to write a data frame to a single csv file, and found out that I can use sdf_coalesce() to turn the file into a single partition. I want to find out if there's any way I can change the name of the csv file generated by…
Assuming sc is an existing spark(lyr) connection, the names given in dplyr::mutate() are ignored:
iris_tbl <- sdf_copy_to(sc, iris)
iris_tbl %>%
spark_apply(function(e){
library(dplyr)
e %>% mutate(slm = median(Sepal_Length))
})
##…
I need to calculate the distance between two strings in R using sparklyr. Is there a way of using stringdist or any other package? I wanted to use cousine distance. This distance is used as a method of stringdist function.
Thanks in advance.
I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr.
id <- c(1,1,1,1,1,2,2,2,3,3,3)
name <-…
I have 100 million rows stored in many .csv files in a distributed file system. I'm using spark_read_csv() to load the data without issue. Many of my columns are stored as character logical values: "true", "false", "". I do not have control…
example sample data
Si K Ca Ba Fe Type
71.78 0.06 8.75 0 0 1
72.73 0.48 7.83 0 0 1
72.99 0.39 7.78 0 0 1
72.61 0.57 na 0 0 na
73.08 0.55 8.07 0 0 1
72.97 0.64 8.07 0…
In Spark 2.0 I can combine several file paths into a single load (see e. g. How to import multiple csv files in a single load?).
How can I achieve this with sparklyr's spark-read-csv?
I've brought a table into Hue which has a column of dates and i'm trying to play with it using sparklyr in Rstudio.
I'd like to convert a character column into a date column like so:
Weather_data = mutate(Weather_data, date2 = as.Date(date,…
After installing sparklyr package I followed the instruction here ( http://spark.rstudio.com/ ) to connect to spark. But faced with this error. Am I doing something wrong. Please help me.
sc = spark_connect( master = 'local' )
Error in file(con,…
Using the sparklyr package on a Hadoop cluster (not a VM), I'm working with several types of tables that need to be joined, filtered, etc... and I'm trying to determine what would be the most efficient way to use the dplyr commands along with the…
Consider the following example
dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z'))
# A tibble: 2 x 1
mydate
1 2011-03-01T00:00:04.226Z
2…
I installed Apache Hive in local and I was trying to read tables via Rstudio/sparklyr.
I created a database using Hive:
hive> CREATE DATABASE test;
and I was trying to read that database using the following R…
I was looking for a way to make spark_write_csv to upload only a single file to S3 because I want to save the regression result on S3. I was wondering if options has some parameter which defines number of partitions. I could not find it anywhere in…
I am reading in a csv into spark using SpraklyR
schema <- structType(structField("TransTime", "array", TRUE),
structField("TransDay", "Date", TRUE))
spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema =…