Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

5 answers

Spark UI on AWS EMR

I am running a AWS EMR cluster with Spark (1.3.1) installed via the EMR console dropdown. Spark is current and processing data but I am trying to find which port has been assigned to the WebUI. I've tried port forwarding both 4040 and 8080 with no…

apache-spark amazon-emr

asked Jul 16 '15 at 16:46

gallamine

votes

1 answer

What is rank in ALS machine Learning Algorithm in Apache Spark Mllib

I Wanted to try an example of ALS machine learning algorithm. And my code works fine, However I do not understand parameter rank used in algorithm. I have following code in java // Build the recommendation model using ALS int rank = 10; …

algorithm apache-spark machine-learning apache-spark-mllib

asked Jun 09 '15 at 10:37

hard coder

5,449
6
36
61

votes

2 answers

Hadoop “Unable to load native-hadoop library for your platform” error on docker-spark?

I am using docker-spark. After starting spark-shell, it outputs: 15/05/21 04:28:22 DEBUG NativeCodeLoader: Failed to load native-hadoop with error: java.lang.UnsatisfiedLinkError:no hadoop in java.library.path 15/05/21 04:28:22 DEBUG…

hadoop apache-spark docker

asked May 21 '15 at 09:12

Nan Xiao

16,671
18
103
164

votes

3 answers

Apache Drill vs Spark

I have some expirience with Apache Spark and Spark-SQL. Recently I've found Apache Drill project. Could you describe me what are the most significant advantages/differences between them? I've already read Fast Hadoop Analytics (Cloudera Impala vs…

hadoop apache-spark bigdata apache-drill

asked Apr 22 '15 at 07:29

Matzz

votes

4 answers

How to read Avro file in PySpark

I am writing a spark job using python. However, I need to read in a whole bunch of avro files. This is the closest solution that I have found in Spark's example folder. However, you need to submit this python script using spark-submit. In the…

python apache-spark avro pyspark

asked Apr 20 '15 at 22:57

B.Mr.W.

18,910
35
114
178

votes

3 answers

Why does Spark RDD partition has 2GB limit for HDFS?

I have get an error when using mllib RandomForest to train data. As my dataset is huge and the default partition is relative small. so an exception thrown indicating that "Size exceeds Integer.MAX_VALUE" ,the orignal stack trace as following,…

scala apache-spark rdd

asked Apr 17 '15 at 03:28

bourneli

2,172
4
24
40

votes

7 answers

Spark-Obtaining file name in RDDs

I am trying to process 4 directories of text files that keep growing every day. What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it. I was able to map and reduce the values…

apache-spark

asked Apr 16 '15 at 21:48

Vipin Bhaskaran

votes

3 answers

How to transpose an RDD in Spark

I have an RDD like this: 1 2 3 4 5 6 7 8 9 It is a matrix. Now I want to transpose the RDD like this: 1 4 7 2 5 8 3 6 9 How can I do this?

scala apache-spark rdd

asked Apr 01 '15 at 12:21

赵祥宇

votes

4 answers

In Apache Spark. How to set worker/executor's environment variables?

My spark program on EMR is constantly getting this error: Caused by: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated at sun.security.ssl.SSLSessionImpl.getPeerCertificates(SSLSessionImpl.java:421) at…

amazon-web-services amazon-s3 apache-spark distributed-computing

asked Mar 30 '15 at 18:59

tribbloid

4,026
14
64
103

votes

4 answers

What is the right way to save\load models in Spark\PySpark

I'm working with Spark 1.3.0 using PySpark and MLlib and I need to save and load my models. I use code like this (taken from the official documentation ) from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating data =…

python apache-spark pyspark apache-spark-mllib

asked Mar 25 '15 at 12:03

artemdevel

votes

3 answers

Why would I want .union over .unionAll in Spark for SchemaRDDs?

I'm trying to wrap my head around these two functions in the Spark SQL documentation– def union(other: RDD[Row]): RDD[Row] Return the union of this RDD and another one. def unionAll(otherPlan: SchemaRDD): SchemaRDD Combines the tuples of two RDDs…

sql scala apache-spark union union-all

asked Mar 12 '15 at 23:56

duber

2,769
4
24
32

votes

3 answers

Spark Java Error: Size exceeds Integer.MAX_VALUE

I am trying to use spark for some simple machine learning task. I used pyspark and spark 1.2.0 to do a simple logistic regression problem. I have 1.2 million records for training, and I hashed the features of the records. When I set the number of…

java python apache-spark distributed-computing logistic-regression

asked Mar 10 '15 at 15:02

peng

votes

1 answer

Co-partitioned joins in spark SQL

Are there any implementations of Spark SQL DataSources that offer Co-partition joins - most likely via the CoGroupRDD? I did not see any uses within the existing Spark codebase. The motivation would be to greatly reduce the shuffle traffic in the…

apache-spark apache-spark-sql

asked Mar 04 '15 at 09:18

WestCoastProjects

58,982
91
316
560

votes

2 answers

Why does foreach not bring anything to the driver program?

I wrote this program in spark shell val array = sc.parallelize(List(1, 2, 3, 4)) array.foreach(x => println(x)) this prints some debug statements but not the actual numbers. The code below works fine for(num <- array.take(4)) { println(num) } I…

apache-spark

asked Mar 02 '15 at 07:32

Knows Not Much

30,395
60
197
373

votes

2 answers

How can I calculate exact median with Apache Spark?

This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median?

scala apache-spark hadoop

asked Jan 26 '15 at 21:04

pckmn

Prev 1 2 3

…

99 100 Next