Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

3 answers

Apache Spark vs Akka

Could you please tell me the difference between Apache Spark and AKKA, I know that both frameworks meant to programme distributed and parallel computations, yet i don't see the link or the difference between them. Moreover, I would like to get the…

apache-spark parallel-processing akka distributed-computing

asked Mar 16 '15 at 23:29

user4658980

votes

8 answers

Pyspark: Pass multiple columns in UDF

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary. I know I can hard code…

apache-spark pyspark apache-spark-sql

asked Mar 01 '17 at 19:17

sjishan

3,392
9
29
53

votes

3 answers

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this…

apache-spark hive pyspark apache-spark-sql hiveql

asked Oct 20 '16 at 18:27

user2205916

3,196
11
54
82

votes

5 answers

Spark sql how to explode without losing null values

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance, id | name |…

java apache-spark null apache-spark-sql

asked Sep 28 '16 at 05:57

alexgbelov

3,032
4
28
42

votes

9 answers

spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes? Pandas: df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime',…

dataframe apache-spark pyspark apache-spark-sql duplicates

asked Jul 31 '16 at 18:35

ad_s

1,560
4
15
16

votes

3 answers

Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?

Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported…

scala apache-spark apache-spark-dataset apache-spark-encoders

asked Jul 29 '16 at 18:04

clay

18,138
28
107
192

votes

3 answers

How to aggregate values into collection after groupBy?

I have a dataframe with schema as such: [visitorId: string, trackingIds: array, emailIds: array] Looking for a way to group (or maybe rollup?) this dataframe by visitorid where the trackingIds and emailIds columns would append…

scala apache-spark apache-spark-sql

asked Dec 10 '15 at 13:20

Eric Patterson

votes

3 answers

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0),…

python timestamp apache-spark pyspark

asked Jun 20 '15 at 00:51

curtisp

2,227
3
30
62

votes

3 answers

How to convert a DataFrame back to normal RDD in pyspark?

I need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data? Note: this is…

python apache-spark pyspark

asked Mar 12 '15 at 01:36

WestCoastProjects

58,982
91
316
560

votes

8 answers

How do I log from my Python Spark script

I have a Python Spark program which I run with spark-submit. I want to put logging statements in it. logging.info("This is an informative message.") logging.debug("This is a debug message.") I want to use the same logger that Spark is using so that…

python apache-spark logging python-logging

asked Aug 20 '14 at 14:37

W.P. McNeill

16,336
12
75
111

votes

8 answers

get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and…

apache-spark pyspark apache-spark-sql

asked Jul 11 '17 at 11:29

Sreenuvasulu

votes

6 answers

How to melt Spark DataFrame?

Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala? I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.

apache-spark pyspark apache-spark-sql melt

asked Jan 16 '17 at 05:42

Venkatesh Durgumahanthi

votes

4 answers

Fetching distinct values on a column using Spark DataFrame

Using Spark 1.6.1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. The column contains more than 50 million records and can grow larger. I understand that doing a distinct.collect() will…

dataframe scala apache-spark apache-spark-sql

asked Aug 14 '16 at 20:30

Kazhiyur

votes

14 answers

Automatically and Elegantly flatten DataFrame in Spark SQL

All, Is there an elegant and accepted way to flatten a Spark SQL table (Parquet) with columns that are of nested StructType For example If my schema is: foo |_bar |_baz x y z How do I select it into a flattened tabular form without resorting to…

scala apache-spark apache-spark-sql

asked May 26 '16 at 21:30

echen

2,002
1
24
38

votes

11 answers

How do I detect if a Spark DataFrame has a column

When I create a DataFrame from a JSON file in Spark SQL, how can I tell if a given column exists before calling .select Example JSON schema: { "a": { "b": 1, "c": 2 } } This is what I want to do: potential_columns = Seq("b", "c",…

scala apache-spark dataframe apache-spark-sql

asked Mar 09 '16 at 22:40

ben

Prev 1 2 3

…

99 100 Next