Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
98
votes
10 answers

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['id']) Join works fine but you can't call the id column because it is ambiguous and you would get the following exception: pyspark.sql.utils.AnalysisException:…
thecheech
  • 2,041
  • 3
  • 18
  • 25
98
votes
6 answers

Spark difference between reduceByKey vs. groupByKey vs. aggregateByKey vs. combineByKey

Can anyone explain the difference between reducebykey, groupbykey, aggregatebykey and combinebykey? I have read the documents regarding this, but couldn't understand the exact differences. An explanation with examples would be great.
Arun S
  • 1,363
  • 3
  • 13
  • 17
97
votes
7 answers

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?
Bamqf
  • 3,382
  • 8
  • 33
  • 47
97
votes
19 answers

How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .bashrc file, I can run spark interactively with python 3. However, if I want to run a standalone program in local mode, I get an…
Kevin
  • 3,391
  • 5
  • 30
  • 40
97
votes
8 answers

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), …
Jason
  • 2,834
  • 6
  • 31
  • 35
96
votes
5 answers

Get current number of partitions of a DataFrame

Is there any way to get the current number of partitions of a DataFrame? I checked the DataFrame javadoc (spark 1.6) and didn't found a method for that, or am I just missed it? (In case of JavaRDD there's a getNumPartitions() method.)
kecso
  • 2,387
  • 2
  • 18
  • 29
96
votes
4 answers

Spark SQL: apply aggregate functions to a list of columns

Is there a way to apply an aggregate function to all (or a list of) columns of a dataframe, when doing a groupBy? In other words, is there a way to avoid doing this for every column: df.groupBy("col1") .agg(sum("col2").alias("col2"),…
lilloraffa
  • 1,367
  • 3
  • 17
  • 22
95
votes
5 answers

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially…
architectonic
  • 2,871
  • 2
  • 21
  • 35
95
votes
5 answers

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be: df.ix[x,y] = new_value Edit: Consolidating…
Luke
  • 6,699
  • 13
  • 50
  • 88
94
votes
10 answers

How to set up Spark on Windows?

I am trying to setup Apache Spark on Windows. After searching a bit, I understand that the standalone mode is what I want. Which binaries do I download in order to run Apache spark in windows? I see distributions with hadoop and cdh at the spark…
Siva
  • 1,839
  • 5
  • 21
  • 31
93
votes
13 answers

Mac spark-shell Error initializing SparkContext

I tried to start spark 1.6.0 (spark-1.6.0-bin-hadoop2.4) on Mac OS Yosemite 10.10.5 using "./bin/spark-shell". It has the error below. I also tried to install different versions of Spark but all have the same error. This is the second time I'm…
Jia
  • 1,301
  • 1
  • 12
  • 18
93
votes
4 answers

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an error: TypeError: Can not infer schema for type: type 'float' I don't understand why... Example: myFloatRdd =…
Breach
  • 1,288
  • 1
  • 11
  • 25
92
votes
6 answers

How to write unit tests in Spark 2.0+?

I've been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext, I couldn't figure out how to get a corresponding example working for SparkSession, even though it…
bbarker
  • 11,636
  • 9
  • 38
  • 62
91
votes
4 answers

What is the difference between spark checkpoint and persist to a disk

What is the difference between spark checkpoint and persist to a disk. Are both these store in the local disk?
nagendra
  • 1,885
  • 3
  • 17
  • 27
91
votes
4 answers

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref,…
user3803714
  • 5,269
  • 10
  • 42
  • 61