Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
16
votes
1 answer

Spill to disk and shuffle write spark

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it…
Giorgio
  • 1,073
  • 3
  • 15
  • 33
16
votes
2 answers

Why is Spark performing worse when using Kryo serialization?

I enabled Kryo serialization for my Spark job, enabled the setting to require registration, and ensured all my types were registered. val conf = new SparkConf() conf.set("spark.serializer",…
Leif Wickland
  • 3,693
  • 26
  • 43
16
votes
3 answers

Unable to create array literal in Spark/PySpark

I'm in trouble trying to remove rows from a dataframe based on two-column list of items to filter. For example, for this dataframe: df = spark.createDataFrame([(100, 'A', 304), (200, 'B', 305), (300, 'C', 306)], ['number', 'letter',…
Mariusz
  • 13,481
  • 3
  • 60
  • 64
16
votes
1 answer

PySpark DataFrame filter using logical AND over list of conditions -- Numpy All Equivalent

I'm trying to filter rows of a PySpark dataframe if the values of all columns are zero. I was hoping to use something like this, (using the numpy function np.all() ): from pyspark.sql.functions import col df.filter(all([(col(c) != 0) for c in…
MarkNS
  • 3,811
  • 2
  • 43
  • 60
16
votes
5 answers

Writing files to local system with Spark in Cluster mode

I know this is a weird way of using Spark but I'm trying to save a dataframe to the local file system (not hdfs) using Spark even though I'm in cluster mode. I know I can use client mode but I do want to run in cluster mode and don't care which…
tkrhgch
  • 343
  • 1
  • 4
  • 14
16
votes
1 answer

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to: The resulting object will be in descending order so that the first element is the most…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
16
votes
2 answers

Fast Parquet row count in Spark

The Parquet files contain a per-block row count field. Spark seems to read it at some point (SpecificParquetRecordReaderBase.java#L151). I tried this in spark-shell: sqlContext.read.load("x.parquet").count And Spark ran two stages, showing various…
Daniel Darabos
  • 26,991
  • 10
  • 102
  • 114
16
votes
10 answers

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark? For example, I have two Dataframes: DF1 C1 C2 23397414 …
MrGildarts
  • 833
  • 1
  • 10
  • 25
16
votes
4 answers

Spark and Java: Exception thrown in awaitResult

I am trying to connect a Spark cluster running within a virtual machine with IP 10.20.30.50 and port 7077 from within a Java application and run the word count example: SparkConf conf = new…
Michael Lihs
  • 7,460
  • 17
  • 52
  • 85
16
votes
3 answers

Apache Spark: java.lang.NoSuchMethodError .rddToPairRDDFunctions

sbt package runs just fine, but after spark-submit I get the error: Exception in thread "main" java.lang.NoSuchMethodError: …
Daniel Kats
  • 5,141
  • 15
  • 65
  • 102
16
votes
3 answers

Loading compressed gzipped csv file in Spark 2.0

How can I load a gzip compressed csv file in Pyspark on Spark 2.0 ? I know that an uncompressed csv file can be loaded as follows: spark.read.format("csv").option("header", "true").load("myfile.csv") or…
femibyte
  • 3,317
  • 7
  • 34
  • 59
16
votes
2 answers

In Spark, is it possible to share data between two executors?

I have a really big read only data that I want all the executors on the same node to use. Is that possible in Spark. I know, you can broadcast variables, but can you broadcast really big arrays. Does, under the hood, it shares data between executors…
pythonic
  • 20,589
  • 43
  • 136
  • 219
16
votes
2 answers

How can I save an RDD into HDFS and later read it back?

I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
pythonic
  • 20,589
  • 43
  • 136
  • 219
16
votes
3 answers

How to connect to remote hive server from spark

I'm running spark locally and want to to access Hive tables, which are located in the remote Hadoop cluster. I'm able to access the hive tables by lauching beeline under SPARK_HOME [ml@master spark-2.0.0]$./bin/beeline Beeline version 1.2.1.spark2…
April
  • 819
  • 2
  • 12
  • 23
16
votes
2 answers

Add one more StructField to schema

My PySpark data frame has the following schema: schema = spark_df.printSchema() root |-- field_1: double (nullable = true) |-- field_2: double (nullable = true) |-- field_3 (nullable = true) |-- field_4: double (nullable = true) |-- field_5:…
Edamame
  • 23,718
  • 73
  • 186
  • 320