Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

165

votes

4 answers

Apache Spark: map vs mapPartitions?

What's the difference between an RDD's map and mapPartitions method? And does flatMap behave like map or like mapPartitions? Thanks. (edit) i.e. what is the difference (either semantically or in terms of execution) between def map[A, B](rdd:…

performance scala apache-spark rdd

asked Jan 17 '14 at 11:41

Nicholas White

2,702
3
24
28

163

votes

18 answers

How to check if spark dataframe is empty?

Right now, I have to use df.count > 0 to check if the DataFrame is empty or not. But it is kind of inefficient. Is there any better way to do that? PS: I want to check if it's empty so that I only save the DataFrame if it's not empty

apache-spark pyspark apache-spark-sql

asked Sep 22 '15 at 02:52

auxdx

2,313
3
21
25

156

votes

7 answers

How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String. I wanted to change the column type to Double type in PySpark. Following is the way, I did: toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType()) changedTypedf =…

python apache-spark dataframe pyspark apache-spark-sql

asked Aug 29 '15 at 09:34

Abhishek Choudhary

8,255
19
69
128

155

votes

12 answers

How to convert rdd object to dataframe in spark

How can I convert an RDD (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]) to a Dataframe org.apache.spark.sql.DataFrame. I converted a dataframe to rdd using .rdd. After processing it I want it back in dataframe. How can I do this ?

scala apache-spark apache-spark-sql rdd

asked Apr 01 '15 at 05:38

user568109

47,225
17
99
123

152

votes

9 answers

How to delete columns in pyspark dataframe

>>> a DataFrame[id: bigint, julian_date: string, user_id: bigint] >>> b DataFrame[id: bigint, quan_created_money: decimal(10,0), quan_created_cnt: bigint] >>> a.join(b, a.id==b.id, 'outer') DataFrame[id: bigint, julian_date: string, user_id: bigint,…

apache-spark apache-spark-sql pyspark

asked Apr 13 '15 at 08:10

xjx0524

1,531
2
10
5

145

votes

12 answers

Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot: [ Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0,…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 18 '15 at 11:16

resec

2,091
3
13
22

145

votes

5 answers

How to define partitioning of DataFrame?

I've started using Spark SQL and DataFrames in Spark 1.4.0. I'm wanting to define a custom partitioner on DataFrames, in Scala, but not seeing how to do this. One of the data tables I'm working with contains a list of transactions, by account,…

scala apache-spark dataframe apache-spark-sql partitioning

asked Jun 23 '15 at 06:48

rake

2,348
3
15
11

139

votes

8 answers

Sort in descending order in PySpark

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code. group_by_dataframe.count().filter("`count` >= 10").sort('count',…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 29 '15 at 15:57

rclakmal

1,872
3
17
19

137

votes

10 answers

How to print the contents of RDD?

I'm attempting to print the contents of a collection to the Spark console. I have a type: linesWithSessionId: org.apache.spark.rdd.RDD[String] = FilteredRDD[3] And I use the command: scala> linesWithSessionId.map(line => println(line)) But this is…

scala apache-spark

asked Apr 19 '14 at 17:54

blue-sky

51,962
152
427
752

136

votes

5 answers

How to kill a running Spark application?

I have a running Spark application where it occupies all the cores where my other applications won't be allocated any resource. I did some quick research and people suggested using YARN kill or /bin/spark-class to kill the command. However, I am…

apache-spark hadoop-yarn pyspark

asked Apr 10 '15 at 15:51

B.Mr.W.

18,910
35
114
178

134

votes

9 answers

How to overwrite the output directory in spark

I have a spark streaming application which produces a dataset for every minute. I need to save/overwrite the results of the processed data. When I tried to overwrite the dataset org.apache.hadoop.mapred.FileAlreadyExistsException stops the…

apache-spark

asked Nov 20 '14 at 07:14

Vijay Innamuri

4,242
7
42
67

134

votes

20 answers

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736) I have Spark installed properly on my…

python apache-spark pyspark

asked Apr 23 '14 at 22:40

Glenn Strycker

4,816
6
31
51

131

votes

13 answers

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() Which creates: +---+---+ | A| …

python apache-spark pyspark apache-spark-sql

asked Oct 19 '15 at 22:04

xenocyon

2,409
3
20
22

130

votes

6 answers

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() And I get a string of nulls. Can anyone…

python apache-spark datetime pyspark apache-spark-sql

asked Jun 28 '16 at 15:45

Jenks

1,950
3
20
27

129

votes

14 answers

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them: from pyspark.sql.functions import randn, rand df_1 = sqlContext.range(0, 10) +--+ |id| +--+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| |…

python apache-spark pyspark apache-spark-sql

asked May 19 '16 at 19:29

Ivan

19,560
31
97
141

Prev 1 2

…

99 100 Next