Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

2 answers

Spark Driver memory and Application Master memory

Am I understanding the documentation for client mode correctly? client mode is opposed to cluster mode where the driver runs within the application master? In client mode the driver and application master are separate processes and therefore…

apache-spark hadoop hadoop-yarn

asked May 18 '18 at 00:03

user782220

10,677
21
72
135

votes

1 answer

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ? So.. I have very simple case: case class Player (playerID: String, birthYear: Int) val playersDs:…

apache-spark dataframe apache-spark-sql apache-spark-dataset

asked May 02 '18 at 07:40

Pawel Niezgoda

votes

4 answers

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table: df=spark.sql('select * from ') +++++++++++++++++++++++++++++++++++++++++++ | Name | URL visited | +++++++++++++++++++++++++++++++++++++++++++ | …

apache-spark pyspark apache-spark-sql

asked Mar 22 '18 at 15:10

user8946942

votes

9 answers

Invalid Spark URL in local spark session

since updating to Spark 2.3.0, tests which are run in my CI (Semaphore) fail due to a allegedly invalid spark url when creating the (local) spark context: 18/03/07 03:07:11 ERROR SparkContext: Error initializing…

apache-spark

asked Mar 07 '18 at 02:43

Lorenz Bernauer

votes

2 answers

Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data

Commmunity! Please help me understand how to get better compression ratio with Spark? Let me describe case: I have dataset, let's call it product on HDFS which was imported using Sqoop ImportTool as-parquet-file using codec snappy. As result of…

apache-spark apache-spark-sql parquet snappy

asked Feb 18 '18 at 01:43

Mikhail Dubkov

1,223
1
12
16

votes

2 answers

Saving contents of df.show() as a string in spark-scala app

I need to save the output of df.show() as a string so that i can email it directly. For ex., the below example taken from official spark docs,: val df = spark.read.json("examples/src/main/resources/people.json") // Displays the content of the…

scala apache-spark log4j

asked Jan 31 '18 at 16:30

Omkar

2,274
6
21
34

votes

3 answers

Error while exploding a struct column in Spark

I have a dataframe whose schema looks like this: event: struct (nullable = true) | | event_category: string (nullable = true) | | event_name: string (nullable = true) | | properties: struct (nullable = true) | | | ErrorCode: string…

scala apache-spark pyspark apache-spark-sql

asked Jan 18 '18 at 06:55

shiva.n404

votes

2 answers

Spark' Dataset unpersist behaviour

Recently I saw some strange behaviour of Spark. I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode: val data = spark.read (...) data.join(df1, "key") //etc, more transformations data.cache(); // used to not…

apache-spark apache-spark-sql

asked Jan 17 '18 at 15:54

T. Gawęda

15,706
4
46
61

votes

1 answer

Spark-Monotonically increasing id not working as expected in dataframe?

I have a dataframe df in Spark which looks something like this: scala> df.show() +--------+--------+ |columna1|columna2| +--------+--------+ | 0.1| 0.4| | 0.2| 0.5| | 0.1| 0.3| | 0.3| 0.6| | 0.2| 0.7| | …

scala apache-spark apache-spark-sql

asked Dec 19 '17 at 20:42

antonioACR1

1,303
2
15
28

votes

3 answers

What is the relationship between tasks and partitions?

Can I say? The number of the Spark tasks equal to the number of the Spark partitions? The executor runs once (batch inside of executor) is equal to one task? Every task produce only a partition? (duplicate of 1.)

apache-spark

asked Dec 12 '17 at 21:48

cdhit

1,384
1
15
38

votes

2 answers

How to get the output from console streaming sink in Zeppelin?

I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found. My question: Does anyone have a working example of…

apache-spark pyspark apache-zeppelin spark-structured-streaming

asked Nov 17 '17 at 18:51

m01

9,033
6
32
58

votes

1 answer

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed). Run 1 : spark_distro.py from pyspark import SparkContext, SparkConf from pyspark import SparkFiles def import_my_special_package(x): from…

apache-spark pyspark hadoop-yarn

asked Nov 08 '17 at 18:54

goks

1,196
3
18
37

votes

2 answers

Dataframe transpose with pyspark in Apache Spark

I have a dataframe df that have following structure: +-----+-----+-----+-------+ | s |col_1|col_2|col_...| +-----+-----+-----+-------+ | f1 | 0.0| 0.6| ... | | f2 | 0.6| 0.7| ... | | f3 | 0.5| 0.9| ... | | ...| ...| ...| ... …

python apache-spark dataframe pyspark transpose

asked Sep 27 '17 at 16:38

Mehdi Ben Hamida

votes

1 answer

How to do count(*) within a spark dataframe groupBy

My intention is to do the equivalent of the basic sql select shipgrp, shipstatus, count(*) cnt from shipstatus group by shipgrp, shipstatus The examples that I have seen for spark dataframes include rollups by other columns:…

scala apache-spark apache-spark-sql

asked Sep 26 '17 at 02:54

WestCoastProjects

58,982
91
316
560

votes

4 answers

How to read streaming dataset once and output to multiple sinks?

I have Spark Structured Streaming Job that reads from S3, transforms the data and then store it to one S3 sink and one Elasticsearch sink. Currently, I am doing readStream once and then writeStream.format("").start() twice. When doing so it seems…

apache-spark spark-structured-streaming

asked Sep 19 '17 at 08:09

s11230

Prev 1 2 3

…

100 Next