Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

4 answers

GroupBy column and filter rows with maximum value in Pyspark

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really…

python apache-spark pyspark apache-spark-sql

asked Feb 16 '18 at 15:31

Thomas

4,696
5
36
71

votes

4 answers

Pyspark: Convert column to lowercase

I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that…

apache-spark pyspark apache-spark-sql

asked Nov 08 '17 at 12:26

wlad

2,073
2
18
29

votes

6 answers

Spark dataframe: collect () vs select ()

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that. Will collect() behave the same way if called on a dataframe? What about the select() method? Does it also work the same…

dataframe apache-spark apache-spark-sql

asked May 25 '17 at 07:27

Mrinal

1,826
2
19
31

votes

6 answers

PySpark create new column with mapping from a dict

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H. I want to create a new column (say col2) with the values from the dict here below. How do I map this? (e,g. 'A' needs to be mapped…

python apache-spark dictionary pyspark apache-spark-sql

asked Mar 23 '17 at 15:39

ad_s

1,560
4
15
16

votes

6 answers

Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: from pyspark import HiveContext from pyspark.sql.types import * from…

python apache-spark pyspark apache-spark-sql window-functions

asked Feb 06 '16 at 22:17

jKraut

2,325
6
35
48

votes

8 answers

Unpacking a list to select multiple columns from a spark data frame

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns? scala> df.columns res0: Array[String] = Array("a", "b", "c", "d") I know I can do something like df.select("b", "c"). But suppose I have a…

apache-spark apache-spark-sql

asked Jan 22 '16 at 03:59

Ben

4,774
5
22
26

votes

7 answers

Filtering a spark dataframe based on date

I have a dataframe of date, string, string I want to select dates before a certain period. I have tried the following with no luck data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime)) I'm getting an error stating the…

apache-spark apache-spark-sql

asked Aug 13 '15 at 17:39

Steve

21,163
21
69
92

votes

2 answers

Which operations preserve RDD order?

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply. Now, which operations preserve that order? E.g., is it guaranteed that (after…

apache-spark rdd

asked Mar 26 '15 at 16:39

sds

58,617
29
161
278

votes

5 answers

Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following: val data = spark.textFile(file, 2).cache() val result = data .map(//some pre-processing) .map(docWeightPar =>…

scala apache-spark

asked Dec 13 '14 at 18:10

Augusto

votes

5 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…

apache-spark apache-spark-sql rdd apache-spark-2.0 bigdata

asked Jun 28 '17 at 16:49

Avishek Bhattacharya

6,534
3
34
53

votes

7 answers

Pyspark: Parse a column of json strings

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I'd like to parse each row and return a new dataframe where each row is the parsed json. # Sample Data Frame jstr1 =…

python json apache-spark pyspark

asked Dec 12 '16 at 19:10

Steve

2,401
3
24
28

votes

6 answers

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below: Ex: [24,23,27,23] should get converted to [24,…

arrays apache-spark pyspark apache-spark-sql user-defined-functions

asked Aug 16 '16 at 21:28

Preyas

votes

4 answers

Generate a Spark StructType / Schema from a case class

If I wanted to create a StructType (i.e. a DataFrame.schema) out of a case class, is there a way to do it without creating a DataFrame? I can easily do: case class TestCase(id: Long) val schema = Seq[TestCase]().toDF.schema But it seems overkill to…

apache-spark apache-spark-sql

asked Apr 20 '16 at 13:53

David Griffin

13,677
5
47
65

votes

3 answers

Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code. In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two…

apache-spark pyspark apache-spark-sql

asked Feb 05 '16 at 07:52

Quentin Pradet

4,691
2
29
41

votes

4 answers

Python Spark Cumulative Sum by Group Using DataFrame

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark? With an example dataset as follows: df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], …

apache-spark pyspark apache-spark-sql

asked Aug 29 '17 at 18:50

mr kw

1,977
2
14
12

Prev 1 2 3

…

99 100 Next