Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

3 answers

Spark, Scala, DataFrame: create feature vectors

I have a DataFrame that looks like follow: userID, category, frequency 1,cat1,1 1,cat2,3 1,cat9,5 2,cat4,6 2,cat9,2 2,cat10,1 3,cat1,5 3,cat7,16 3,cat8,2 The number of distinct categories is 10, and I would like to create a feature vector for each…

scala apache-spark apache-spark-sql apache-spark-ml

asked Nov 23 '15 at 08:45

Rami

8,044
18
66
108

votes

7 answers

How to import pyspark in anaconda

I am trying to import and use pyspark with anaconda. After installing spark, and setting the $SPARK_HOME variable I tried: $ pip install pyspark This won't work (of course) because I discovered that I need to tel python to look for pyspark under…

python apache-spark anaconda pyspark

asked Nov 19 '15 at 20:43

farhawa

10,120
16
49
91

votes

5 answers

Spark - Container is running beyond physical memory limits

I have a cluster of two worker nodes. Worker_Node_1 - 64GB RAM Worker_Node_2 - 32GB RAM Background Summery : I am trying to execute spark-submit on yarn-cluster to run Pregel on a Graph to calculate the shortest path distances from one source…

hadoop apache-spark spark-graphx

asked Nov 17 '15 at 14:34

mn0102

votes

1 answer

when to use mapParitions and mapPartitionsWithIndex?

The PySpark documentation describes two functions: mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> def f(iterator): yield…

apache-spark pyspark

asked Nov 11 '15 at 17:09

Chris Snow

23,813
35
144
309

votes

1 answer

Preserve index-string correspondence spark string indexer

Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using…

python apache-spark apache-spark-sql pyspark apache-spark-ml

asked Nov 10 '15 at 18:24

moustachio

2,924
3
36
68

votes

1 answer

Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib, built on top of RDDs. spark.ml, built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and…

machine-learning apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Oct 20 '15 at 12:47

Paladini

4,522
15
53
96

votes

1 answer

Is it possible to access estimator attributes in spark.ml pipelines?

I have a spark.ml pipeline in Spark 1.5.1 which consists of a series of transformers followed by a k-means estimator. I want to be able to access the KMeansModel.clusterCenters after fitting the pipeline, but can't figure out how. Is there a…

scala apache-spark pipeline apache-spark-ml

asked Oct 19 '15 at 17:04

hilarious

votes

4 answers

Where is the Spark UI on Google Dataproc?

What port should I use to access the Spark UI on Google Dataproc? I tried port 4040 and 7077 as well as a bunch of other ports I found using netstat -pln Firewall is properly configured.

apache-spark google-cloud-dataproc

asked Oct 18 '15 at 00:35

BAR

15,909
27
97
185

votes

3 answers

Extract document-topic matrix from Pyspark LDA Model

I have successfully trained an LDA model in spark, via the Python API: from pyspark.mllib.clustering import LDA model=LDA.train(corpus,k=10) This works completely fine, but I now need the document-topic matrix for the LDA model, but as far as I can…

python apache-spark pyspark lda

asked Oct 12 '15 at 02:37

moustachio

2,924
3
36
68

votes

7 answers

How to randomly sample from a Scala list or array?

I want to randomly sample from a Scala list or array (not an RDD), the sample size can be much longer than the length of the list or array, how can I do this efficiently? Because the sample size can be very big and the sampling (on different…

arrays list scala apache-spark sample

asked Oct 04 '15 at 09:59

Carter

1,563
8
23
32

votes

4 answers

How to zip two (or more) DataFrame in Spark

scala apache-spark dataframe apache-spark-sql

asked Oct 01 '15 at 08:08

worldterminator

2,968
6
33
52

votes

2 answers

How to filter one spark dataframe against another dataframe

I'm trying to filter one dataframe against another: scala> val df1 = sc.parallelize((1 to 100).map(a=>(s"user $a", a*0.123, a))).toDF("name", "score", "user_id") scala> val df2 = sc.parallelize(List(2,3,4,5,6)).toDF("valid_id") Now I want to filter…

scala apache-spark apache-spark-sql

asked Sep 18 '15 at 23:46

polo

1,352
2
16
35

votes

3 answers

How to change SparkContext properties in Interactive PySpark session

How can I change spark.driver.maxResultSize in pyspark interactive shell? I have used the following code from pyspark import SparkConf, SparkContext conf = (SparkConf() .set("spark.driver.maxResultSize",…

python apache-spark pyspark

asked Sep 02 '15 at 20:45

MARK

2,302
4
25
44

votes

3 answers

Spark - Creating Nested DataFrame

I'm starting with PySpark and I'm having troubles with creating DataFrames with nested objects. This is my example. I have users. $ cat user.json {"id":1,"name":"UserA"} {"id":2,"name":"UserB"} Users have orders. $ cat…

python apache-spark dataframe pyspark apache-spark-sql

asked Aug 10 '15 at 12:18

Maciek Bryński

votes

4 answers

What is the correct way to start/stop spark streaming jobs in yarn?

I have been experimenting and googling for many hours, with no luck. I have a spark streaming app that runs fine in a local spark cluster. Now I need to deploy it on cloudera 5.4.4. I need to be able to start it, have it run in the background…

hadoop apache-spark spark-streaming hadoop-yarn cloudera

asked Jul 28 '15 at 18:25

Kevin Pauli

8,577
15
49
70

Prev 1 2 3

…

99 100 Next