Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
60
votes
4 answers

GroupBy column and filter rows with maximum value in Pyspark

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really…
Thomas
  • 4,696
  • 5
  • 36
  • 71
60
votes
4 answers

Pyspark: Convert column to lowercase

I want to convert the values inside a column to lowercase. Currently if I use the lower() method, it complains that column objects are not callable. Since there's a function called lower() in SQL, I assume there's a native Spark solution that…
wlad
  • 2,073
  • 2
  • 18
  • 29
60
votes
6 answers

Spark dataframe: collect () vs select ()

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that. Will collect() behave the same way if called on a dataframe? What about the select() method? Does it also work the same…
Mrinal
  • 1,826
  • 2
  • 19
  • 31
60
votes
6 answers

PySpark create new column with mapping from a dict

Using Spark 1.6, I have a Spark DataFrame column (named let's say col1) with values A, B, C, DS, DNS, E, F, G and H. I want to create a new column (say col2) with the values from the dict here below. How do I map this? (e,g. 'A' needs to be mapped…
ad_s
  • 1,560
  • 4
  • 15
  • 16
60
votes
6 answers

Spark SQL Row_number() PartitionBy Sort Desc

I've successfully create a row_number() partitionBy by in Spark using Window, but would like to sort this by descending, instead of the default ascending. Here is my working code: from pyspark import HiveContext from pyspark.sql.types import * from…
jKraut
  • 2,325
  • 6
  • 35
  • 48
60
votes
8 answers

Unpacking a list to select multiple columns from a spark data frame

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns? scala> df.columns res0: Array[String] = Array("a", "b", "c", "d") I know I can do something like df.select("b", "c"). But suppose I have a…
Ben
  • 4,774
  • 5
  • 22
  • 26
60
votes
7 answers

Filtering a spark dataframe based on date

I have a dataframe of date, string, string I want to select dates before a certain period. I have tried the following with no luck data.filter(data("date") < new java.sql.Date(format.parse("2015-03-14").getTime)) I'm getting an error stating the…
Steve
  • 21,163
  • 21
  • 69
  • 92
60
votes
2 answers

Which operations preserve RDD order?

RDD has a meaningful (as opposed to some random order imposed by the storage model) order if it was processed by sortBy(), as explained in this reply. Now, which operations preserve that order? E.g., is it guaranteed that (after…
sds
  • 58,617
  • 29
  • 161
  • 278
60
votes
5 answers

Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following: val data = spark.textFile(file, 2).cache() val result = data .map(//some pre-processing) .map(docWeightPar =>…
Augusto
  • 988
  • 1
  • 11
  • 19
59
votes
5 answers

Spark parquet partitioning : Large number of files

I am trying to leverage spark partitioning. I was trying to do something like data.write.partitionBy("key").parquet("/location") The issue here each partition creates huge number of parquet files which result slow read if I am trying to read from…
59
votes
7 answers

Pyspark: Parse a column of json strings

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I'd like to parse each row and return a new dataframe where each row is the parsed json. # Sample Data Frame jstr1 =…
Steve
  • 2,401
  • 3
  • 24
  • 28
59
votes
6 answers

Spark Error:expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)

I have a dataframe in Spark in which one of the columns contains an array.Now,I have written a separate UDF which converts the array to another array with distinct values in it only. See example below: Ex: [24,23,27,23] should get converted to [24,…
59
votes
4 answers

Generate a Spark StructType / Schema from a case class

If I wanted to create a StructType (i.e. a DataFrame.schema) out of a case class, is there a way to do it without creating a DataFrame? I can easily do: case class TestCase(id: Long) val schema = Seq[TestCase]().toDF.schema But it seems overkill to…
David Griffin
  • 13,677
  • 5
  • 47
  • 65
59
votes
3 answers

Find maximum row per group in Spark DataFrame

I'm trying to use Spark dataframes instead of RDDs since they appear to be more high-level than RDDs and tend to produce more readable code. In a 14-nodes Google Dataproc cluster, I have about 6 millions names that are translated to ids by two…
Quentin Pradet
  • 4,691
  • 2
  • 29
  • 41
58
votes
4 answers

Python Spark Cumulative Sum by Group Using DataFrame

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark? With an example dataset as follows: df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], …
mr kw
  • 1,977
  • 2
  • 14
  • 12