Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
70
votes
2 answers

Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4

I am trying to persist my RDD using off heap storage on spark 1.4.0 and tachyon 0.6.4 doing it like this : val a = sqlContext.parquetFile("a1.parquet") a.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) a.count() Afterwards I am getting the…
qwertz1123
  • 1,173
  • 10
  • 27
70
votes
7 answers

Apache Spark logging within Scala

I am looking for a solution to be able to log additional data when executing code on Apache Spark Nodes that could help investigate later some issues that might appear during execution. Trying to use a traditional solution like for example…
Bogdan N
  • 741
  • 1
  • 6
  • 9
69
votes
1 answer

Spark load data and add filename as dataframe column

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode",…
yee379
  • 6,498
  • 10
  • 56
  • 101
69
votes
6 answers

Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below: user_id object_id score user_1 object_1 3 user_1 object_1 1 user_1 object_2 2 user_2 object_1 5 user_2 object_2 2 user_2 object_2 6 What I expect is returning 2 records in each group…
KAs
  • 1,818
  • 4
  • 19
  • 37
69
votes
5 answers

PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived…
sjishan
  • 3,392
  • 9
  • 29
  • 53
69
votes
2 answers

How DAG works under the covers in RDD?

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the…
sof
  • 9,113
  • 16
  • 57
  • 83
69
votes
10 answers

Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job E.g. sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"))) .writeAsMultiple(prefix,…
samthebest
  • 30,803
  • 25
  • 102
  • 142
68
votes
3 answers

Convert a spark DataFrame to pandas DF

Is there a way to convert a Spark Df (not RDD) to pandas DF I tried the following: var some_df = Seq( ("A", "no"), ("B", "yes"), ("B", "yes"), ("B", "no") ).toDF( "user_id", "phone_number") Code: %pyspark pandas_df =…
data_person
  • 4,194
  • 7
  • 40
  • 75
68
votes
13 answers

Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values. df.createOrReplaceTempView("incidents") spark.sql("select Date from incidents").show() +----------+ | …
Ishan Kumar
  • 1,941
  • 3
  • 20
  • 29
68
votes
7 answers

Spark Driver in Apache spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark…
user3789843
  • 1,009
  • 2
  • 11
  • 18
67
votes
3 answers

Difference in Used, Committed and Max Heap Memory

I am monitoring a spark executor JVM of a OutOfMemoryException. I used Jconsole to connect to executor JVM. Following is the snapshot of Jconsole: In the image used memory is shown as 3.8G and committed memory is 8.6G and Max memory is also…
Alok
  • 1,374
  • 3
  • 18
  • 44
67
votes
5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…
lauri108
  • 1,381
  • 1
  • 13
  • 22
67
votes
3 answers

Difference between == and === in Scala, Spark

I am from a Java background and new to Scala. I am using Scala and Spark. But I'm not able to understand where I use ==and ===. Could anyone let me know in which scenario I need to use these two operators, and what's are difference between == and…
Avijit
  • 1,770
  • 5
  • 16
  • 34
67
votes
5 answers

Spark unionAll multiple dataframes

For a set of dataframes val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x") val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y") val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z") to union all of them…
echo
  • 1,241
  • 1
  • 13
  • 16
67
votes
9 answers

DataFrame equality in Apache Spark

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API. Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic),…
Sim
  • 13,147
  • 9
  • 66
  • 95