Questions tagged [sparklyr]

sparklyr is an alternative R interface for Apache Spark

sparklyr provides an alternative to interface for built on top of .

External links:

784 questions
4
votes
1 answer

Is there a way to set a name to a csv file in sparklyr using spark_write_csv?

I need to write a data frame to a single csv file, and found out that I can use sdf_coalesce() to turn the file into a single partition. I want to find out if there's any way I can change the name of the csv file generated by…
4
votes
1 answer

colnames in `sparklyr::spark_apply()` using `dplyr::mutate()`

Assuming sc is an existing spark(lyr) connection, the names given in dplyr::mutate() are ignored: iris_tbl <- sdf_copy_to(sc, iris) iris_tbl %>% spark_apply(function(e){ library(dplyr) e %>% mutate(slm = median(Sepal_Length)) }) ##…
nachti
  • 1,086
  • 7
  • 20
4
votes
1 answer

How to calculate distance between strings using sparklyr?

I need to calculate the distance between two strings in R using sparklyr. Is there a way of using stringdist or any other package? I wanted to use cousine distance. This distance is used as a method of stringdist function. Thanks in advance.
4
votes
1 answer

Unnest (seperate) multiple column values into new rows using Sparklyr

I am trying to split column values separated by comma(,) into new rows based on id's. I know how to do this in R using dplyr and tidyr. But I am looking to solve same problem in sparklyr. id <- c(1,1,1,1,1,2,2,2,3,3,3) name <-…
Rushabh Patel
  • 2,672
  • 13
  • 34
4
votes
1 answer

Convert a string to logical in R with sparklyr

I have 100 million rows stored in many .csv files in a distributed file system. I'm using spark_read_csv() to load the data without issue. Many of my columns are stored as character logical values: "true", "false", "". I do not have control…
kputschko
  • 766
  • 1
  • 7
  • 21
4
votes
1 answer

how to find colums having missing data in sparklyr

example sample data Si K Ca Ba Fe Type 71.78 0.06 8.75 0 0 1 72.73 0.48 7.83 0 0 1 72.99 0.39 7.78 0 0 1 72.61 0.57 na 0 0 na 73.08 0.55 8.07 0 0 1 72.97 0.64 8.07 0…
vijaynadal
  • 55
  • 5
4
votes
2 answers

Reading files from multiple sub folders in sparklyr

In Spark 2.0 I can combine several file paths into a single load (see e. g. How to import multiple csv files in a single load?). How can I achieve this with sparklyr's spark-read-csv?
Deepdelusion
  • 121
  • 1
  • 6
4
votes
1 answer

How limit number of lines read from a parquet file in sparklyr

I have a huge parquet file that dont fits in memory nor in disk when read, theres a way to use spark_read_parquet to only read the first n lines?
Jader Martins
  • 759
  • 6
  • 26
4
votes
1 answer

Converting string/chr to date using sparklyr

I've brought a table into Hue which has a column of dates and i'm trying to play with it using sparklyr in Rstudio. I'd like to convert a character column into a date column like so: Weather_data = mutate(Weather_data, date2 = as.Date(date,…
Keith
  • 103
  • 1
  • 9
4
votes
2 answers

Connecting to Spark with Sparklyr gives Permission Denied Error

After installing sparklyr package I followed the instruction here ( http://spark.rstudio.com/ ) to connect to spark. But faced with this error. Am I doing something wrong. Please help me. sc = spark_connect( master = 'local' ) Error in file(con,…
boral
  • 131
  • 9
4
votes
1 answer

What is the most efficient way to create new Spark Tables or Data Frames in Sparklyr?

Using the sparklyr package on a Hadoop cluster (not a VM), I'm working with several types of tables that need to be joined, filtered, etc... and I'm trying to determine what would be the most efficient way to use the dplyr commands along with the…
quickreaction
  • 675
  • 5
  • 17
4
votes
2 answers

Sparklyr/Hive: how to use regex (regexp_replace) correctly?

Consider the following example dataframe_test<- data_frame(mydate = c('2011-03-01T00:00:04.226Z', '2011-03-01T00:00:04.226Z')) # A tibble: 2 x 1 mydate 1 2011-03-01T00:00:04.226Z 2…
ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
4
votes
1 answer

sparklyr can't see databases created in Hive and vice versa

I installed Apache Hive in local and I was trying to read tables via Rstudio/sparklyr. I created a database using Hive: hive> CREATE DATABASE test; and I was trying to read that database using the following R…
stochazesthai
  • 617
  • 1
  • 7
  • 20
4
votes
1 answer

What is the options parameter of spark_write_csv dplyr function?

I was looking for a way to make spark_write_csv to upload only a single file to S3 because I want to save the regression result on S3. I was wondering if options has some parameter which defines number of partitions. I could not find it anywhere in…
4
votes
2 answers

Specifying col type in Sparklyr (spark_read_csv)

I am reading in a csv into spark using SpraklyR schema <- structType(structField("TransTime", "array", TRUE), structField("TransDay", "Date", TRUE)) spark_read_csv(sc, filename, "path", infer_schema = FALSE, schema =…
Levi Brackman
  • 325
  • 2
  • 17