Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

2 answers

Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4

I am trying to persist my RDD using off heap storage on spark 1.4.0 and tachyon 0.6.4 doing it like this : val a = sqlContext.parquetFile("a1.parquet") a.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP) a.count() Afterwards I am getting the…

apache-spark apache-spark-sql alluxio

asked May 06 '15 at 20:37

qwertz1123

1,173
10
27

votes

7 answers

Apache Spark logging within Scala

I am looking for a solution to be able to log additional data when executing code on Apache Spark Nodes that could help investigate later some issues that might appear during execution. Trying to use a traditional solution like for example…

scala logging apache-spark

asked Mar 23 '15 at 11:14

Bogdan N

votes

1 answer

Spark load data and add filename as dataframe column

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode",…

apache-spark pyspark apache-spark-sql

asked Oct 05 '16 at 07:50

yee379

6,498
10
56
101

votes

6 answers

Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below: user_id object_id score user_1 object_1 3 user_1 object_1 1 user_1 object_2 2 user_2 object_1 5 user_2 object_2 2 user_2 object_2 6 What I expect is returning 2 records in each group…

python apache-spark dataframe pyspark apache-spark-sql

asked Jul 15 '16 at 13:49

KAs

1,818
4
19
37

votes

5 answers

PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived…

python apache-spark dataframe pyspark apache-spark-sql

asked Jun 08 '16 at 15:51

sjishan

3,392
9
29
53

votes

2 answers

How DAG works under the covers in RDD?

The Spark research paper has prescribed a new distributed programming model over classic Hadoop MapReduce, claiming the simplification and vast performance boost in many cases specially on Machine Learning. However, the material to uncover the…

apache-spark rdd directed-acyclic-graphs

asked Sep 14 '14 at 17:59

sof

9,113
16
57
83

votes

10 answers

Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job E.g. sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"))) .writeAsMultiple(prefix,…

scala hadoop output hdfs apache-spark

asked Jun 02 '14 at 12:54

samthebest

30,803
25
102
142

votes

3 answers

Convert a spark DataFrame to pandas DF

Is there a way to convert a Spark Df (not RDD) to pandas DF I tried the following: var some_df = Seq( ("A", "no"), ("B", "yes"), ("B", "yes"), ("B", "no") ).toDF( "user_id", "phone_number") Code: %pyspark pandas_df =…

pandas apache-spark apache-spark-sql

asked Jun 21 '18 at 00:16

data_person

4,194
7
40
75

votes

13 answers

Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values. df.createOrReplaceTempView("incidents") spark.sql("select Date from incidents").show() +----------+ | …

apache-spark apache-spark-sql

asked Nov 23 '16 at 11:52

Ishan Kumar

1,941
3
20
29

votes

7 answers

Spark Driver in Apache spark

I already have a cluster of 3 machines (ubuntu1,ubuntu2,ubuntu3 by VM virtualbox) running Hadoop 1.0.0. I installed spark on each of these machines. ub1 is my master node and the other nodes are working as slave. My question is what exactly a spark…

apache-spark

asked Jul 08 '14 at 16:43

user3789843

1,009
2
11
18

votes

3 answers

Difference in Used, Committed and Max Heap Memory

I am monitoring a spark executor JVM of a OutOfMemoryException. I used Jconsole to connect to executor JVM. Following is the snapshot of Jconsole: In the image used memory is shown as 3.8G and committed memory is 8.6G and Max memory is also…

java apache-spark memory-management jvm spark-streaming

asked Jan 04 '17 at 16:25

Alok

1,374
3
18
44

votes

5 answers

"Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on…

apache-spark emr amazon-emr bigdata

asked Nov 24 '16 at 08:33

lauri108

1,381
1
13
22

votes

3 answers

Difference between == and === in Scala, Spark

I am from a Java background and new to Scala. I am using Scala and Spark. But I'm not able to understand where I use ==and ===. Could anyone let me know in which scenario I need to use these two operators, and what's are difference between == and…

scala apache-spark

asked Sep 14 '16 at 12:16

Avijit

1,770
5
16
34

votes

5 answers

Spark unionAll multiple dataframes

For a set of dataframes val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x") val df2 = sc.parallelize(1 to 4).map(i => (i,i*100)).toDF("id","y") val df3 = sc.parallelize(1 to 4).map(i => (i,i*1000)).toDF("id","z") to union all of them…

scala apache-spark apache-spark-sql

asked Jun 03 '16 at 11:00

echo

1,241
1
13
16

votes

9 answers

DataFrame equality in Apache Spark

Assume df1 and df2 are two DataFrames in Apache Spark, computed using two different mechanisms, e.g., Spark SQL vs. the Scala/Java/Python API. Is there an idiomatic way to determine whether the two data frames are equivalent (equal, isomorphic),…

scala apache-spark dataframe apache-spark-sql rdd

asked Jul 03 '15 at 02:00

Sim

13,147
9
66
95

Prev 1 2 3

…

99 100 Next