Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
15
votes
2 answers

Spark Driver memory and Application Master memory

Am I understanding the documentation for client mode correctly? client mode is opposed to cluster mode where the driver runs within the application master? In client mode the driver and application master are separate processes and therefore…
user782220
  • 10,677
  • 21
  • 72
  • 135
15
votes
1 answer

Why is predicate pushdown not used in typed Dataset API (vs untyped DataFrame API)?

I always thought that dataset/dataframe API's are the same.. and the only difference is that dataset API will give you compile time safety. Right ? So.. I have very simple case: case class Player (playerID: String, birthYear: Int) val playersDs:…
15
votes
4 answers

How to convert rows into a list of dictionaries in pyspark?

I have a DataFrame(df) in pyspark, by reading from a hive table: df=spark.sql('select * from ') +++++++++++++++++++++++++++++++++++++++++++ | Name | URL visited | +++++++++++++++++++++++++++++++++++++++++++ | …
user8946942
  • 169
  • 1
  • 1
  • 3
15
votes
9 answers

Invalid Spark URL in local spark session

since updating to Spark 2.3.0, tests which are run in my CI (Semaphore) fail due to a allegedly invalid spark url when creating the (local) spark context: 18/03/07 03:07:11 ERROR SparkContext: Error initializing…
Lorenz Bernauer
  • 215
  • 1
  • 2
  • 7
15
votes
2 answers

Spark + Parquet + Snappy: Overall compression ratio loses after spark shuffles data

Commmunity! Please help me understand how to get better compression ratio with Spark? Let me describe case: I have dataset, let's call it product on HDFS which was imported using Sqoop ImportTool as-parquet-file using codec snappy. As result of…
Mikhail Dubkov
  • 1,223
  • 1
  • 12
  • 16
15
votes
2 answers

Saving contents of df.show() as a string in spark-scala app

I need to save the output of df.show() as a string so that i can email it directly. For ex., the below example taken from official spark docs,: val df = spark.read.json("examples/src/main/resources/people.json") // Displays the content of the…
Omkar
  • 2,274
  • 6
  • 21
  • 34
15
votes
3 answers

Error while exploding a struct column in Spark

I have a dataframe whose schema looks like this: event: struct (nullable = true) | | event_category: string (nullable = true) | | event_name: string (nullable = true) | | properties: struct (nullable = true) | | | ErrorCode: string…
shiva.n404
  • 463
  • 1
  • 7
  • 18
15
votes
2 answers

Spark' Dataset unpersist behaviour

Recently I saw some strange behaviour of Spark. I have a pipeline in my application in which I'm manipulating one big Dataset - pseudocode: val data = spark.read (...) data.join(df1, "key") //etc, more transformations data.cache(); // used to not…
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
15
votes
1 answer

Spark-Monotonically increasing id not working as expected in dataframe?

I have a dataframe df in Spark which looks something like this: scala> df.show() +--------+--------+ |columna1|columna2| +--------+--------+ | 0.1| 0.4| | 0.2| 0.5| | 0.1| 0.3| | 0.3| 0.6| | 0.2| 0.7| | …
antonioACR1
  • 1,303
  • 2
  • 15
  • 28
15
votes
3 answers

What is the relationship between tasks and partitions?

Can I say? The number of the Spark tasks equal to the number of the Spark partitions? The executor runs once (batch inside of executor) is equal to one task? Every task produce only a partition? (duplicate of 1.)
cdhit
  • 1,384
  • 1
  • 15
  • 38
15
votes
2 answers

How to get the output from console streaming sink in Zeppelin?

I'm struggling to get the console sink working with PySpark Structured Streaming when run from Zeppelin. Basically, I'm not seeing any results printed to the screen, or to any logfiles I've found. My question: Does anyone have a working example of…
m01
  • 9,033
  • 6
  • 32
  • 58
15
votes
1 answer

--files option in pyspark not working

I tried sc.addFile option (working without any issues) and --files option from the command line (failed). Run 1 : spark_distro.py from pyspark import SparkContext, SparkConf from pyspark import SparkFiles def import_my_special_package(x): from…
goks
  • 1,196
  • 3
  • 18
  • 37
15
votes
2 answers

Dataframe transpose with pyspark in Apache Spark

I have a dataframe df that have following structure: +-----+-----+-----+-------+ | s |col_1|col_2|col_...| +-----+-----+-----+-------+ | f1 | 0.0| 0.6| ... | | f2 | 0.6| 0.7| ... | | f3 | 0.5| 0.9| ... | | ...| ...| ...| ... …
Mehdi Ben Hamida
  • 893
  • 4
  • 16
  • 38
15
votes
1 answer

How to do count(*) within a spark dataframe groupBy

My intention is to do the equivalent of the basic sql select shipgrp, shipstatus, count(*) cnt from shipstatus group by shipgrp, shipstatus The examples that I have seen for spark dataframes include rollups by other columns:…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
15
votes
4 answers

How to read streaming dataset once and output to multiple sinks?

I have Spark Structured Streaming Job that reads from S3, transforms the data and then store it to one S3 sink and one Elasticsearch sink. Currently, I am doing readStream once and then writeStream.format("").start() twice. When doing so it seems…
s11230
  • 151
  • 1
  • 4
1 2 3
99
100