Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

1 answer

Spill to disk and shuffle write spark

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it…

apache-spark rdd shuffle

asked Jan 15 '17 at 13:50

Giorgio

1,073
3
15
33

votes

2 answers

Why is Spark performing worse when using Kryo serialization?

I enabled Kryo serialization for my Spark job, enabled the setting to require registration, and ensured all my types were registered. val conf = new SparkConf() conf.set("spark.serializer",…

scala performance apache-spark avro kryo

asked Jan 09 '17 at 17:05

Leif Wickland

3,693
26
43

votes

3 answers

Unable to create array literal in Spark/PySpark

I'm in trouble trying to remove rows from a dataframe based on two-column list of items to filter. For example, for this dataframe: df = spark.createDataFrame([(100, 'A', 304), (200, 'B', 305), (300, 'C', 306)], ['number', 'letter',…

arrays dataframe apache-spark pyspark literals

asked Jan 06 '17 at 18:22

Mariusz

13,481
3
60
64

votes

1 answer

PySpark DataFrame filter using logical AND over list of conditions -- Numpy All Equivalent

I'm trying to filter rows of a PySpark dataframe if the values of all columns are zero. I was hoping to use something like this, (using the numpy function np.all() ): from pyspark.sql.functions import col df.filter(all([(col(c) != 0) for c in…

python numpy apache-spark pyspark apache-spark-sql

asked Dec 20 '16 at 10:04

MarkNS

3,811
2
43
60

votes

5 answers

Writing files to local system with Spark in Cluster mode

I know this is a weird way of using Spark but I'm trying to save a dataframe to the local file system (not hdfs) using Spark even though I'm in cluster mode. I know I can use client mode but I do want to run in cluster mode and don't care which…

scala hadoop apache-spark

asked Nov 24 '16 at 12:10

tkrhgch

votes

1 answer

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to: The resulting object will be in descending order so that the first element is the most…

apache-spark apache-spark-sql apache-spark-dataset

asked Nov 21 '16 at 17:21

Georg Heiler

16,916
36
162
292

votes

2 answers

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151). I tried this in spark-shell: sqlContext.read.load("x.parquet").count And Spark ran two stages, showing various…

apache-spark parquet

asked Nov 16 '16 at 10:15

Daniel Darabos

26,991
10
102
114

votes

10 answers

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark? For example, I have two Dataframes: DF1 C1 C2 23397414 …

apache-spark pyspark apache-spark-sql

asked Nov 09 '16 at 13:44

MrGildarts

votes

4 answers

Spark and Java: Exception thrown in awaitResult

I am trying to connect a Spark cluster running within a virtual machine with IP 10.20.30.50 and port 7077 from within a Java application and run the word count example: SparkConf conf = new…

java scala apache-spark hdfs protocol-buffers

asked Nov 05 '16 at 14:58

Michael Lihs

7,460
17
52
85

votes

3 answers

Apache Spark: java.lang.NoSuchMethodError .rddToPairRDDFunctions

sbt package runs just fine, but after spark-submit I get the error: Exception in thread "main" java.lang.NoSuchMethodError: …

scala apache-spark

asked Nov 02 '16 at 19:42

Daniel Kats

5,141
15
65
102

votes

3 answers

Loading compressed gzipped csv file in Spark 2.0

How can I load a gzip compressed csv file in Pyspark on Spark 2.0 ? I know that an uncompressed csv file can be loaded as follows: spark.read.format("csv").option("header", "true").load("myfile.csv") or…

apache-spark pyspark

asked Nov 02 '16 at 10:37

femibyte

3,317
7
34
59

votes

2 answers

In Spark, is it possible to share data between two executors?

I have a really big read only data that I want all the executors on the same node to use. Is that possible in Spark. I know, you can broadcast variables, but can you broadcast really big arrays. Does, under the hood, it shares data between executors…

java scala apache-spark

asked Oct 22 '16 at 09:57

pythonic

20,589
43
136
219

votes

2 answers

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?

scala apache-spark hdfs rdd bigdata

asked Oct 16 '16 at 10:28

pythonic

20,589
43
136
219

votes

3 answers

How to connect to remote hive server from spark

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2…

apache-spark hive apache-spark-sql spark-thriftserver

asked Oct 12 '16 at 11:16

April

votes

2 answers

Add one more StructField to schema

python apache-spark pyspark apache-spark-sql

asked Sep 18 '16 at 18:38

Edamame

23,718
73
186
320

Prev 1 2 3

…

99 100 Next