Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
67
votes
3 answers

Apache Spark vs Akka

Could you please tell me the difference between Apache Spark and AKKA, I know that both frameworks meant to programme distributed and parallel computations, yet i don't see the link or the difference between them. Moreover, I would like to get the…
user4658980
66
votes
8 answers

Pyspark: Pass multiple columns in UDF

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary. I know I can hard code…
sjishan
  • 3,392
  • 9
  • 29
  • 53
66
votes
3 answers

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this…
user2205916
  • 3,196
  • 11
  • 54
  • 82
66
votes
5 answers

Spark sql how to explode without losing null values

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance, id | name |…
alexgbelov
  • 3,032
  • 4
  • 28
  • 42
66
votes
9 answers

spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes? Pandas: df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime',…
ad_s
  • 1,560
  • 4
  • 15
  • 16
66
votes
3 answers

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported…
clay
  • 18,138
  • 28
  • 107
  • 192
66
votes
3 answers

How to aggregate values into collection after groupBy?

I have a dataframe with schema as such: [visitorId: string, trackingIds: array, emailIds: array] Looking for a way to group (or maybe rollup?) this dataframe by visitorid where the trackingIds and emailIds columns would append…
Eric Patterson
  • 763
  • 1
  • 5
  • 4
66
votes
3 answers

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0),…
curtisp
  • 2,227
  • 3
  • 30
  • 62
66
votes
3 answers

How to convert a DataFrame back to normal RDD in pyspark?

I need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data? Note: this is…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
66
votes
8 answers

How do I log from my Python Spark script

I have a Python Spark program which I run with spark-submit. I want to put logging statements in it. logging.info("This is an informative message.") logging.debug("This is a debug message.") I want to use the same logger that Spark is using so that…
W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
65
votes
8 answers

get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and…
Sreenuvasulu
  • 653
  • 1
  • 5
  • 9
65
votes
6 answers

How to melt Spark DataFrame?

Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala? I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.
65
votes
4 answers

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger. I understand that doing a distinct.collect() will…
Kazhiyur
  • 865
  • 2
  • 10
  • 14
65
votes
14 answers

Automatically and Elegantly flatten DataFrame in Spark SQL

All, Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType For example If my schema is: foo |_bar |_baz x y z How do I select it into a flattened tabular form without resorting to…
echen
  • 2,002
  • 1
  • 24
  • 38
65
votes
11 answers

How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select Example JSON schema: { "a": { "b": 1, "c": 2 } } This is what I want to do: potential_columns = Seq("b", "c",…
ben
  • 653
  • 1
  • 5
  • 7