Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
127
votes
16 answers

Spark - Error "A master URL must be set in your configuration" when submitting an app

I have an Spark app which runs with no problem in local mode,but have some problems when submitting to the Spark cluster. The error msg are as follows: 16/06/24 15:42:06 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2,…
Shuai Zhang
  • 2,011
  • 3
  • 22
  • 23
126
votes
42 answers

Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: Java gateway process exited before sending the driver its port number when sc = SparkContext() is being called upon startup. I have tried running the…
mt88
  • 2,855
  • 8
  • 24
  • 42
126
votes
13 answers

Load CSV file with PySpark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…
Kernael
  • 3,270
  • 4
  • 22
  • 42
123
votes
11 answers

Can Apache Spark run without Hadoop?

Are there any dependencies between Spark and Hadoop? If not, are there any features I'll miss when I run Spark without Hadoop?
tourist
  • 4,165
  • 6
  • 25
  • 47
122
votes
4 answers

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

What's the difference between spark.sql.shuffle.partitions and spark.default.parallelism? I have tried to set both of them in SparkSQL, but the task number of the second stage is always 200.
Edison
  • 1,225
  • 2
  • 10
  • 8
121
votes
15 answers

How to load local file in sc.textFile, instead of HDFS

I'm following the great spark tutorial so i'm trying at 46m:00s to load the README.md but fail to what i'm doing is this: $ sudo docker run -i -t -h sandbox sequenceiq/spark:1.1.0 /etc/bootstrap.sh -bash bash-4.1# cd…
Jas
  • 14,493
  • 27
  • 97
  • 148
120
votes
15 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command: sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…
Francesco Sambo
  • 1,213
  • 2
  • 9
  • 6
120
votes
2 answers

What do the numbers on the progress bar mean in spark-shell?

In my spark-shell, what do entries like the below mean when I execute a function ? [Stage7:===========> (14174 + 5) / 62500]
rmckeown
  • 1,201
  • 2
  • 8
  • 5
119
votes
8 answers

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some…
Chris
  • 1,195
  • 2
  • 7
  • 7
119
votes
10 answers

Why do Spark jobs fail with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 in speculation mode?

I'm running a Spark job with in a speculation mode. I have around 500 tasks and around 500 files of 1 GB gz compressed. I keep getting in each job, for 1-2 tasks, the attached error where it reruns afterward dozens of times (preventing the job to…
dotan
  • 1,484
  • 3
  • 11
  • 13
118
votes
9 answers

How to export a table dataframe in PySpark to csv?

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it "table") to a csv file so I can manipulate it and plot the columns.…
PyRsquared
  • 6,970
  • 11
  • 50
  • 86
117
votes
10 answers

How to create an empty DataFrame with a specified schema?

I want to create on DataFrame with a specified schema in Scala. I have tried to use JSON read (I mean reading empty file) but I don't think that's the best practice.
user1735076
  • 3,225
  • 7
  • 19
  • 16
116
votes
13 answers

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. spark Eclipse on windows 7

I'm not able to run a simple spark job in Scala IDE (Maven spark project) installed on Windows 7 Spark core dependency has been added. val conf = new SparkConf().setAppName("DemoDF").setMaster("local") val sc = new SparkContext(conf) val logData =…
Elvish_Blade
  • 1,220
  • 3
  • 12
  • 13
115
votes
3 answers

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam word count example, it feels it is very similar to…
bluenote10
  • 23,414
  • 14
  • 122
  • 178
115
votes
2 answers

What does "Stage Skipped" mean in Apache Spark web UI?

From my Spark UI. What does it mean by skipped?
Aravind Yarram
  • 78,777
  • 46
  • 231
  • 327