Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
17
votes
3 answers

How to set preferences for ALS implicit feedback in Collaborative Filtering?

I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields userId and productId. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I…
zork
  • 2,085
  • 6
  • 32
  • 48
17
votes
1 answer

apache spark MLLib: how to build labeled points for string features?

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems…
17
votes
3 answers

dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)
user3769729
  • 171
  • 1
  • 1
  • 4
17
votes
1 answer

In Spark, what is the right way to have a static object on all workers?

I've been looking at the documentation for Spark and it mentions this: Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this: Anonymous function syntax, which can be…
Daniel Langdon
  • 5,899
  • 4
  • 28
  • 48
17
votes
1 answer

error: not found: type SparkConf

I installed spark. pre-compiled and standalone. But both are unable to run val conf = new SparkConf(). The error is error: not found: type SparkConf: scala> val conf = new SparkConf() :10: error: not found: type SparkConf The pre-compiled…
del
  • 199
  • 1
  • 1
  • 7
17
votes
5 answers

Is caching the only advantage of spark over map-reduce?

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
17
votes
3 answers

It is possible to start an embedded instance of apache Spark node?

I want to start an instance of a standalone Apache Spark cluster embedded into my java app. I tried to find some documentation at their website but not look yet. Is this possible?
Rodrigo
  • 195
  • 1
  • 10
17
votes
2 answers

Does groupByKey in Spark preserve the original order?

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable) pair RDD. Yet, is this function stable? i.e is the order in the iterable preserved from the original order? For example, if I originally read a file of the…
Jean Logeart
  • 52,687
  • 11
  • 83
  • 118
17
votes
1 answer

How jobs are assigned to executors in Spark Streaming?

Let's say I've got 2 or more executors in a Spark Streaming application. I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS. If the every job lasts for more than 10 seconds, the new job that is…
gprivitera
  • 933
  • 1
  • 8
  • 22
17
votes
1 answer

Modify collection inside a Spark RDD foreach

I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, but the modifications are not happening. It all works fine adding directly or iterating other collections: scala> val myMap = new…
palako
  • 3,342
  • 2
  • 23
  • 33
17
votes
3 answers

Spark: Writing to Avro file

I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem =>…
user1013725
  • 571
  • 1
  • 4
  • 17
16
votes
2 answers

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv…
Raja
  • 507
  • 1
  • 6
  • 24
16
votes
2 answers

Spark: What is the difference between repartition and repartitionByRange?

I went through the documentation here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html It says: for repartition: resulting DataFrame is hash partitioned. for repartitionByRange: resulting DataFrame is range partitioned. And a…
pallupz
  • 793
  • 3
  • 9
  • 25
16
votes
3 answers

"'JavaPackage' object is not callable" error executing explain() in Pyspark 3.0.1 via Zeppelin

I am running Pyspark 3.0.1 for Hadoop 2.7 in a Zeppelin notebook. In general all is well, however when I execute df.explain() on a DataFrame I get this error: Fail to execute line 3: df.explain() Traceback (most recent call last): File…
Phil
  • 598
  • 1
  • 9
  • 21
16
votes
2 answers

spark windowing function VS group by performance issue

I have a dataframe done like | id | date | KPI_1 | ... | KPI_n | 1 |2012-12-12 | 0.1 | ... | 0.5 | 2 |2012-12-12 | 0.2 | ... | 0.4 | 3 |2012-12-12 | 0.66 | ... | 0.66 | 1 |2012-12-13 | 0.2 | ... | 0.46 | 4 |2012-12-14 | …
JayZee
  • 851
  • 1
  • 10
  • 17