Questions tagged [sparklyr]

sparklyr is an alternative R interface for Apache Spark

sparklyr provides an alternative to interface for built on top of .

External links:

784 questions
4
votes
1 answer

Transfer data from database to Spark using sparklyr

I have some data in a database, and I want to work with it in Spark, using sparklyr. I can use a DBI-based package to import the data from the database into R dbconn <- dbConnect() data_in_r <- dbReadTable(dbconn, "a table")…
Richie Cotton
  • 118,240
  • 47
  • 247
  • 360
4
votes
1 answer

Install Spark on Windows for sparklyr

I have tried several tutorials on setting up Spark and Hadoop in a Windows environment, especially alongside R. This one resulted in this error by the time I hit figure 9: This tutorial from Rstudio is giving me issues as well. When I get to…
d8aninja
  • 3,233
  • 4
  • 36
  • 60
4
votes
1 answer

sparklyr: skip first lines of text files

I would like to skip (dropping out) the first two lines of a text file: to the best of my knowledge this is not possible with sparklyr method spark_read_csv. There is some workaround to solve this simple problem? I know the existance of sparklyr…
enneppi
  • 1,029
  • 2
  • 15
  • 33
4
votes
1 answer

Can't read csv into Spark using spark_read_csv()

I'm trying to use sparklyr to read a csv file into R. I can read the .csv into R just fine using read.csv(), but when I try to use spark_read_csv() it breaks down. accidents <- spark_read_csv(sc, name = 'accidents', path =…
Raphael K
  • 2,265
  • 1
  • 16
  • 23
3
votes
0 answers

How to display Sparklyr table in a clean readable format similar to the output of display() in Databricks?

There exist a Databricks’s built-in display() function (see documentation here) which allow users to display R or SparkR dataframe in a clean and human readable manner where user can scroll to see all the columns and perform sorting on the columns.…
SG_
  • 69
  • 1
  • 6
3
votes
0 answers

How do you use sdf_checkpoint to break spark table lineage in sparklyr?

I'm attempting to manipulate a Spark RDD via sparklyr with a dplyr mutate command to construct a large number of variables, and each time this seems to fail with an error message regarding Java memory exceeding 64 bits. The mutate command is coded…
3
votes
2 answers

Databricks Delta Table Merge statement using R

I have recently started working on Databricks and I have been trying to find a way to perform a merge statement on a Delta table, though using an R api (preferably sparklyr). The ultimate purpose is to somehow impose a 'duplicate' constraint as…
takmers
  • 71
  • 1
  • 5
3
votes
0 answers

sparklyr connecting to kafka streams/topics

I'm having difficulty connecting to and retrieving data from a kafka instance. Using python's kafka-python module, I can connect (using the same connection parameters), see the topic, and retrieve data, so the network is viable, there is no…
r2evans
  • 141,215
  • 6
  • 77
  • 149
3
votes
0 answers

Apply function after groupby in Sparklyr

I have a sparklyr dataframe roughly like this full_df %>% head() absolute_time file_number battery_soc ... 2020-09-01 04:57:45 7 99.0 ... 2020-09-01 04:57:47 7 98.0 ... 2020-09-01 04:58:31 7 95.0 ... 2020-09-01…
Laurent
  • 1,914
  • 2
  • 11
  • 25
3
votes
4 answers

select all columns after a designated column using R

How can I select all columns after a designated column using R (ideally dplyr only but non-dplyr solutions welcome). For example, say in the dataframe mtcars, I want to grab all columns after the vs that would be am gear carb. But I want a function…
Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
3
votes
1 answer

Sparklyr connection error: Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, : Gateway in localhost:8880 did not respond

I am having the following issue while connecting to sparkyr. sc <- spark_connect(master = "local") * Using Spark: 2.4.3 Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, : Gateway in localhost:8880 did not respond. Try…
Aashiq Reza
  • 153
  • 2
  • 13
3
votes
1 answer

How do I get the word-embedding matrix from ft_word2vec (sparklyr-package)?

I have another question in the word2vec universe. I am using the 'sparklyr'-package. Within this package I call the ft_word2vec() function. I have some trouble understanding the output: For each number of sentences/paragraphs I am providing to the…
3
votes
1 answer

How to extract the first n rows per group from a Spark data frame using recent versions of dplyr (1.0), sparklyr (1.4) and SPARK (3.0) / Hadoop (2.7)?

My attempts with top_n() and scale_head() both failed with errors. An issue with top_n() was reported in https://github.com/tidyverse/dplyr/issues/4467 and closed by Hadley with the comment: This will be resolved by #4687 + tidyverse/dbplyr#394…
jimbod119
  • 2,871
  • 1
  • 6
  • 11
3
votes
2 answers

Complete dataframe in sparklyr

I am trying to replicate the tidyr:complete function in sparklyr. I have a dataframe with some missing values and I have to fill out those rows. In dplyr/tidyr I can do: data <- tibble( "id" = c(1,1,2,2), "dates" = c("2020-01-01", "2020-01-03",…
Marco De Virgilis
  • 982
  • 1
  • 9
  • 29
3
votes
0 answers

Sparklyr - fail to connect to "local"

When trying to connect to spark using sparklyr, I get the following error message: 'Error in spark_connect_gateway(gatewayAddress, gatewayPort, sessionId, : Gateway in localhost:8880 did not respond.' There is no other info displayed in…
Treaver
  • 31
  • 1