Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

3 answers

How to set preferences for ALS implicit feedback in Collaborative Filtering?

I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields userId and productId. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I…

scala machine-learning apache-spark collaborative-filtering

asked Dec 26 '14 at 15:57

zork

2,085
6
32
48

votes

1 answer

apache spark MLLib: how to build labeled points for string features?

I am trying to build a NaiveBayes classifier with Spark's MLLib which takes as input a set of documents. I'd like to put some things as features (i.e. authors, explicit tags, implicit keywords, category), but looking at the documentation it seems…

java apache-spark machine-learning apache-spark-mllib feature-selection

asked Dec 06 '14 at 18:01

riffraff

2,429
1
23
32

votes

3 answers

dynamically bind variable/parameter in Spark SQL?

How to bind variable in Apache Spark SQL? For example: val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("SELECT * FROM src WHERE col1 = ${VAL1}").collect().foreach(println)

scala apache-spark apache-spark-sql apache-spark-2.0

asked Nov 05 '14 at 10:44

user3769729

votes

1 answer

In Spark, what is the right way to have a static object on all workers?

I've been looking at the documentation for Spark and it mentions this: Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this: Anonymous function syntax, which can be…

scala apache-spark

asked Oct 14 '14 at 20:42

Daniel Langdon

5,899
4
28
48

votes

1 answer

error: not found: type SparkConf

I installed spark. pre-compiled and standalone. But both are unable to run val conf = new SparkConf(). The error is error: not found: type SparkConf: scala> val conf = new SparkConf() :10: error: not found: type SparkConf The pre-compiled…

scala apache-spark

asked Jul 29 '14 at 09:41

del

votes

5 answers

Is caching the only advantage of spark over map-reduce?

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need…

caching hadoop apache-spark

asked Jul 11 '14 at 20:08

Knows Not Much

30,395
60
197
373

votes

3 answers

It is possible to start an embedded instance of apache Spark node?

I want to start an instance of a standalone Apache Spark cluster embedded into my java app. I tried to find some documentation at their website but not look yet. Is this possible?

java mapreduce apache-spark

asked Jun 25 '14 at 15:10

Rodrigo

votes

2 answers

Does groupByKey in Spark preserve the original order?

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable) pair RDD. Yet, is this function stable? i.e is the order in the iterable preserved from the original order? For example, if I originally read a file of the…

scala apache-spark

asked Jun 13 '14 at 13:35

Jean Logeart

52,687
11
83
118

votes

1 answer

How jobs are assigned to executors in Spark Streaming?

Let's say I've got 2 or more executors in a Spark Streaming application. I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS. If the every job lasts for more than 10 seconds, the new job that is…

job-scheduling apache-spark executor

asked May 07 '14 at 20:42

gprivitera

votes

1 answer

Modify collection inside a Spark RDD foreach

I'm trying to add elements to a map while iterating the elements of an RDD. I'm not getting any errors, but the modifications are not happening. It all works fine adding directly or iterating other collections: scala> val myMap = new…

scala apache-spark rdd

asked Apr 30 '14 at 17:19

palako

3,342
2
23
33

votes

3 answers

Spark: Writing to Avro file

I am in Spark, I have an RDD from an Avro file. I now want to do some transformations on that RDD and save it back as an Avro file: val job = new Job(new Configuration()) AvroJob.setOutputKeySchema(job, getOutputSchema(inputSchema)) rdd.map(elem =>…

scala serialization avro apache-spark

asked Dec 16 '13 at 13:51

user1013725

votes

2 answers

pyspark read multiple csv files at once

I'm using SPARK to read files in hdfs. There is a scenario, where we are getting files as chunks from legacy system in csv…

apache-spark pyspark hive

asked Sep 27 '21 at 17:02

Raja

votes

2 answers

Spark: What is the difference between repartition and repartitionByRange?

I went through the documentation here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html It says: for repartition: resulting DataFrame is hash partitioned. for repartitionByRange: resulting DataFrame is range partitioned. And a…

apache-spark pyspark apache-spark-sql

asked Jan 20 '21 at 12:50

pallupz

votes

3 answers

"'JavaPackage' object is not callable" error executing explain() in Pyspark 3.0.1 via Zeppelin

I am running Pyspark 3.0.1 for Hadoop 2.7 in a Zeppelin notebook. In general all is well, however when I execute df.explain() on a DataFrame I get this error: Fail to execute line 3: df.explain() Traceback (most recent call last): File…

apache-spark pyspark

asked Jan 14 '21 at 04:07

Phil

votes

2 answers

spark windowing function VS group by performance issue

I have a dataframe done like | id | date | KPI_1 | ... | KPI_n | 1 |2012-12-12 | 0.1 | ... | 0.5 | 2 |2012-12-12 | 0.2 | ... | 0.4 | 3 |2012-12-12 | 0.66 | ... | 0.66 | 1 |2012-12-13 | 0.2 | ... | 0.46 | 4 |2012-12-14 | …

apache-spark apache-spark-sql

asked Jan 23 '19 at 17:49

JayZee

Prev 1 2 3

…

99 100 Next