Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

10 answers

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: ------------------------ id | date | value ------------------------ 1 |2014-01-03 …

python apache-spark pyspark

asked Oct 05 '17 at 07:34

Ravi

3,223
7
37
49

votes

10 answers

How to pivot Spark DataFrame?

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in Python, but I can't find…

dataframe apache-spark pyspark apache-spark-sql pivot

asked May 14 '15 at 18:42

J Calbreath

2,665
4
22
31

votes

2 answers

Spark - SELECT WHERE or filtering?

What's the difference between selecting with a where clause and filtering in Spark? Are there any use cases in which one is more appropriate than the other one? When do I use DataFrame newdf =…

apache-spark apache-spark-sql

asked Aug 10 '16 at 08:01

lte__

7,175
25
74
131

votes

17 answers

How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible…

python apache-spark pyspark pycharm homebrew

asked Jan 08 '16 at 20:55

tumbleweed

4,624
12
50
81

votes

8 answers

How to pass -D parameter or environment variable to Spark job?

I want to change Typesafe config of a Spark job in dev/prod environment. It seems to me that the easiest way to accomplish this is to pass -Dconfig.resource=ENVNAME to the job. Then Typesafe config library will do the job for me. Is there way to…

scala apache-spark

asked Jan 27 '15 at 09:06

kopiczko

3,018
1
17
24

votes

4 answers

What is the relationship between workers, worker instances, and executors?

In Spark Standalone mode, there are master and worker nodes. Here are few questions: Does 2 worker instance mean one worker node with 2 worker processes? Does every worker instance hold an executor for specific application (which manages storage,…

apache-spark apache-spark-standalone

asked Jul 11 '14 at 11:34

edwardsbean

3,619
5
21
25

votes

4 answers

Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 07 '16 at 21:02

Steve

2,401
3
24
28

votes

12 answers

how to filter out a null value from spark dataframe

scala apache-spark apache-spark-sql

asked Sep 27 '16 at 14:46

Steven Li

votes

3 answers

How does HashPartitioner work?

I read up on the documentation of HashPartitioner. Unfortunately nothing much was explained except for the API calls. I am under the assumption that HashPartitioner partitions the distributed set based on the hash of the keys. For example if my data…

scala apache-spark rdd partitioning

asked Jul 15 '15 at 07:46

Sohaib

4,556
8
40
68

votes

22 answers

How to perform union on two DataFrames with different amounts of columns in Spark?

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?

python apache-spark pyspark apache-spark-sql union

asked Sep 28 '16 at 21:34

Allan Feliph

votes

9 answers

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. This question is similar to this question: How can I…

python apache-spark median rdd pyspark

asked Jul 15 '15 at 14:11

pr338

8,730
19
52
71

votes

13 answers

Provide schema while reading csv file as a dataframe in Scala Spark

I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below. val pagecount =…

scala apache-spark dataframe apache-spark-sql spark-csv

asked Oct 07 '16 at 22:02

Pa1

votes

16 answers

How to check the Spark version

as titled, how do I know which version of spark has been installed in the CentOS? The current system has installed cdh5.1.0.

apache-spark cloudera-cdh

asked Apr 17 '15 at 03:52

HappyCoding

5,029
7
31
51

votes

4 answers

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them…

dataframe apache-spark pyspark apache-spark-sql

asked Jan 24 '18 at 16:24

pault

41,343
15
107
149

votes

5 answers

What is the concept of application, job, stage and task in spark?

Is my understanding right? Application: one spark submit. job: once a lazy evaluation happens, there is a job. stage: It is related to the shuffle and the transformation type. It is hard for me to understand the boundary of the stage. task: It…

apache-spark

asked Feb 16 '17 at 01:35

cdhit

1,384
1
15
38

Prev 1 2 3

…

99 100 Next