Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

2 answers

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting…

amazon-web-services machine-learning apache-spark elastic-map-reduce

asked Sep 21 '15 at 19:22

Vlad Kutsenko

votes

1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,…

apache-spark apache-spark-mllib tf-idf apache-spark-ml

asked Aug 29 '15 at 11:46

Srini

3,334
6
29
64

votes

3 answers

Spark Launcher waiting for job completion infinitely

I am trying to submit a JAR with Spark job into the YARN cluster from Java code. I am using SparkLauncher to submit SparkPi example: Process spark = new SparkLauncher() …

java apache-spark hadoop-yarn spark-launcher

asked Jul 31 '15 at 20:04

TomaszGuzialek

votes

1 answer

Usage of spark DataFrame "as" method

I am looking at spark.sql.DataFrame documentation. There is def as(alias: String): DataFrame Returns a new DataFrame with an alias set. Since 1.3.0 What is the purpose of this method? How is it used? Can there be an example? I have…

scala apache-spark dataframe apache-spark-sql

asked Jul 21 '15 at 11:10

Prikso NAI

2,592
4
16
29

votes

3 answers

Automatically including jars to PySpark classpath

I'm trying to automatically include jars to my PySpark classpath. Right now I can type the following command and it works: $ pyspark --jars /path/to/my.jar I'd like to have that jar included by default so that I can only type pyspark and also use…

apache-spark ipython jupyter-notebook pyspark

asked Jul 16 '15 at 21:28

Kamil Sindi

21,782
19
96
120

votes

3 answers

Spark off heap memory leak on Yarn with Kafka direct stream

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support. The issue I am seeing is that both driver and executor containers are gradually…

apache-spark spark-streaming hadoop-yarn apache-spark-1.4

asked Jul 13 '15 at 18:01

Apoorva Sareen

votes

1 answer

How do you perform basic joins of two RDD tables in Spark using Python?

How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for: Inner Join Left Outer Join Cross Join With two tables (RDD) with a single column in each that has a…

python join apache-spark pyspark rdd

asked Jul 06 '15 at 22:55

invoketheshell

3,819
2
20
35

votes

2 answers

How to use spark Java API to read binary file stream from HDFS?

I am writing a component which needs to get the new binary file in a specific HDFS path, so that I can do some online learning based on this data. So, I want to read binary file created by Flume from HDFS in stream. I found several functions…

java hadoop apache-spark streaming

asked Jun 23 '15 at 15:50

Ying Tan

votes

2 answers

Apache Spark: get elements of Row by name

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract values by name? I can see how to do some really awkward stuff: def foo(r: Row) = { val ix = (0 until…

scala apache-spark schema dataframe

asked Jun 05 '15 at 19:30

Ken Williams

22,756
10
85
147

votes

2 answers

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm. For example, there is one LogisticRegression in org.apache.spark.ml.classification also a…

scala apache-spark apache-spark-mllib

asked May 14 '15 at 07:35

ailzhang

votes

1 answer

How to use constant value in UDF of Spark SQL(DataFrame)

I have a dataframe which includes timestamp. To aggregate by time(minute, hour, or day), I have tried as: val toSegment = udf((timestamp: String) => { val asLong = timestamp.toLong asLong - asLong % 3600000 // period = 1 hour }) val df:…

scala apache-spark apache-spark-sql

asked Apr 02 '15 at 07:01

emesday

6,078
3
29
46

votes

5 answers

PySpark & MLLib: Random Forest Feature Importances

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature…

apache-spark pyspark random-forest apache-spark-mllib

asked Mar 10 '15 at 19:01

Bryan

5,999
9
29
50

votes

4 answers

How to get the number of elements in partition?

Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition. Something like this: Rdd.partitions().get(index).size() Except I don't see such an API for spark. Any ideas?…

apache-spark partitioning

asked Feb 24 '15 at 02:20

Geo

votes

1 answer

How does Apache Spark know about HDFS data nodes?

Imagine I do some Spark operations on a file hosted in HDFS. Something like this: var file = sc.textFile("hdfs://...") val items = file.map(_.split('\t')) ... Because in the Hadoop world the code should go where the data is, right? So my question…

hadoop apache-spark hdfs

asked Feb 12 '15 at 15:44

Frizz

2,524
6
31
45

votes

7 answers

java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags

I am a beginner in Spark streaming and Scala. For a project requirement I was trying to run TwitterPopularTags example present in github. As SBT assembly was not working for me and I was not familiar with SBT I am trying to use Maven for building.…

scala maven apache-spark noclassdeffounderror spark-streaming

asked Jan 27 '15 at 07:07

vpv

Prev 1 2 3

…

99 100 Next