Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

127

votes

16 answers

Spark - Error "A master URL must be set in your configuration" when submitting an app

I have an Spark app which runs with no problem in local mode,but have some problems when submitting to the Spark cluster. The error msg are as follows: 16/06/24 15:42:06 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2,…

scala apache-spark

asked Jun 24 '16 at 07:52

Shuai Zhang

2,011
3
22
23

126

votes

42 answers

Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: Java gateway process exited before sending the driver its port number when sc = SparkContext() is being called upon startup. I have tried running the…

java python macos apache-spark pyspark

asked Aug 05 '15 at 19:45

mt88

2,855
8
24
42

126

votes

13 answers

Load CSV file with PySpark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…

python csv apache-spark pyspark apache-spark-sql

asked Feb 28 '15 at 14:41

Kernael

3,270
4
22
42

123

votes

11 answers

Can Apache Spark run without Hadoop?

Are there any dependencies between Spark and Hadoop? If not, are there any features I'll miss when I run Spark without Hadoop?

hadoop amazon-s3 apache-spark mapreduce mesos

asked Aug 15 '15 at 06:51

tourist

4,165
6
25
47

122

votes

4 answers

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.

performance apache-spark hadoop apache-spark-sql

asked Aug 16 '17 at 02:22

Edison

1,225
2
10
8

121

votes

15 answers

How to load local file in sc.textFile, instead of HDFS

I'm following the great spark tutorial so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this: $ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash bash-4.1# cd…

scala apache-spark

asked Dec 04 '14 at 17:12

Jas

14,493
27
97
148

120

votes

15 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command: sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…

dataframe apache-spark pyspark apache-spark-sql

asked Mar 21 '16 at 13:27

Francesco Sambo

1,213
2
9
6

120

votes

2 answers

What do the numbers on the progress bar mean in spark-shell?

In my spark-shell, what do entries like the below mean when I execute a function ? [Stage7:===========> (14174 + 5) / 62500]

apache-spark

asked May 14 '15 at 18:56

rmckeown

1,201
2
8
5

119

votes

8 answers

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some…

apache-spark pyspark

asked Nov 04 '19 at 20:10

Chris

1,195
2
7
7

119

votes

10 answers

Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

I'm running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attached error where it reruns afterward dozens of times (preventing the job to…

apache-spark

asked Mar 06 '15 at 14:40

dotan

1,484
3
11
13

118

votes

9 answers

How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns.…

python apache-spark dataframe apache-spark-sql export-to-csv

asked Jul 13 '15 at 13:56

PyRsquared

6,970
11
50
86

117

votes

10 answers

How to create an empty DataFrame with a specified schema?

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.

dataframe scala apache-spark apache-spark-sql schema

asked Jul 17 '15 at 13:58

user1735076

3,225
7
19
16

116

votes

13 answers

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7

I'm not able to run a simple spark job in Scala IDE (Maven spark project) installed on Windows 7 Spark core dependency has been added. val conf = new SparkConf().setAppName("DemoDF").setMaster("local") val sc = new SparkContext(conf) val logData =…

eclipse scala apache-spark

asked Feb 26 '16 at 13:12

Elvish_Blade

1,220
3
12
13

115

votes

3 answers

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam word count example, it feels it is very similar to…

apache-spark apache-flink apache-beam

asked Apr 24 '17 at 06:26

bluenote10

23,414
14
122
178

115

votes

2 answers

What does "Stage Skipped" mean in Apache Spark web UI?

From my Spark UI. What does it mean by skipped?

apache-spark rdd

asked Jan 03 '16 at 19:26

Aravind Yarram

78,777
46
231
327

Prev 1 2 3

…

99 100 Next