Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

10 answers

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['id']) Join works fine but you can't call the id column because it is ambiguous and you would get the following exception: pyspark.sql.utils.AnalysisException:…

python apache-spark pyspark apache-spark-sql

asked Oct 26 '17 at 01:33

thecheech

2,041
3
18
25

votes

6 answers

Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey

Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences. An explanation with examples would be great.

apache-spark grouping reducing

asked Apr 12 '17 at 08:38

Arun S

1,363
3
13
17

votes

7 answers

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?

python apache-spark pyspark apache-spark-sql

asked Oct 20 '16 at 19:38

Bamqf

3,382
8
33
47

votes

19 answers

How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .bashrc file, I can run spark interactively with python 3. However, if I want to run a standalone program in local mode, I get an…

python apache-spark pyspark

asked May 28 '15 at 22:52

Kevin

3,391
5
30
40

votes

8 answers

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), …

apache-spark apache-spark-sql pyspark

asked May 14 '15 at 22:03

Jason

2,834
6
31
35

votes

5 answers

Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)

python scala dataframe apache-spark apache-spark-sql

asked Feb 11 '17 at 02:24

kecso

2,387
2
18
29

votes

4 answers

Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column: df.groupBy("col1") .agg(sum("col2").alias("col2"),…

apache-spark dataframe apache-spark-sql aggregate-functions

asked Nov 23 '15 at 23:40

lilloraffa

1,367
3
17
22

votes

5 answers

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially…

python apache-spark dataframe pyspark apache-spark-sql

asked Oct 09 '15 at 12:45

architectonic

2,871
2
21
35

votes

5 answers

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be: df.ix[x,y] = new_value Edit: Consolidating…

python dataframe apache-spark pyspark apache-spark-sql

asked Mar 17 '15 at 21:19

Luke

6,699
13
50
88

votes

10 answers

How to set up Spark on Windows?

I am trying to setup Apache Spark on Windows. After searching a bit, I understand that the standalone mode is what I want. Which binaries do I download in order to run Apache spark in windows? I see distributions with hadoop and cdh at the spark…

windows apache-spark

asked Aug 25 '14 at 07:50

Siva

1,839
5
21
31

votes

13 answers

Mac spark-shell Error initializing SparkContext

I tried to start spark 1.6.0 (spark-1.6.0-bin-hadoop2.4) on Mac OS Yosemite 10.10.5 using "./bin/spark-shell". It has the error below. I also tried to install different versions of Spark but all have the same error. This is the second time I'm…

apache-spark

asked Jan 04 '16 at 23:09

Jia

1,301
1
12
18

votes

4 answers

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an error: TypeError: Can not infer schema for type: type 'float' I don't understand why... Example: myFloatRdd =…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 23 '15 at 14:13

Breach

1,288
1
11
25

votes

6 answers

How to write unit tests in Spark 2.0+?

I've been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext, I couldn't figure out how to get a corresponding example working for SparkSession, even though it…

scala unit-testing apache-spark junit apache-spark-sql

asked May 02 '17 at 02:46

bbarker

11,636
9
38
62

votes

4 answers

What is the difference between spark checkpoint and persist to a disk

What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?

apache-spark

asked Feb 01 '16 at 10:06

nagendra

1,885
3
17
27

votes

4 answers

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref,…

python apache-spark join pyspark apache-spark-sql

asked Nov 16 '15 at 22:37

user3803714

5,269
10
42
61

Prev 1 2 3

…

99 100 Next