Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

2 answers

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches. The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc…

scala apache-spark spark-streaming

asked Mar 16 '16 at 17:02

Lawrence Benson

1,398
1
16
33

votes

4 answers

What is a glom?. How it is different from mapPartitions?

I've come across the glom() method on RDD. As per the documentation Return an RDD created by coalescing all elements within each partition into an array Does glom shuffle the data across the partitions or does it only return the partition data as…

apache-spark rdd

asked Mar 02 '16 at 04:27

nagendra

1,885
3
17
27

votes

5 answers

Spark : check your cluster UI to ensure that workers are registered

I have a simple program in Spark: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val conf = new…

scala hadoop apache-spark cloudera cloudera-manager

asked Feb 26 '16 at 22:02

vineet sinha

votes

1 answer

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image. I'm trying to do this in Scala: import…

apache-spark hive apache-spark-sql

asked Feb 20 '16 at 21:01

JFX

votes

1 answer

How do I collect a single column in Spark?

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected. Here is an example: df =…

apache-spark dataframe pyspark apache-spark-sql

asked Feb 19 '16 at 00:32

Michal

1,863
7
30
50

votes

1 answer

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala def round_tenths_place( un_rounded:Double ) : Double = { val rounded = BigDecimal(un_rounded).setScale(1,…

scala apache-spark dataframe apache-spark-sql user-defined-functions

asked Feb 05 '16 at 15:19

Michael Discenza

3,240
7
30
41

votes

5 answers

How to compute cumulative sum using Spark

I have an rdd of (String,Int) which is sorted by key val data = Array(("c1",6), ("c2",3),("c3",4)) val rdd = sc.parallelize(data).sortByKey Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous…

scala apache-spark

asked Feb 02 '16 at 13:01

Knight71

2,927
5
37
63

votes

1 answer

Spark ML indexer cannot resolve DataFrame column name with dots?

I have a DataFrame with a column named a.b. When I specify a.b as the input column name to a StringIndexer, AnalysisException with the message "cannot resolve 'a.b' given input columns a.b". I'm using Spark 1.6.0. I'm aware that older versions of…

java apache-spark apache-spark-mllib apache-spark-ml

asked Jan 22 '16 at 18:22

Joshua Taylor

84,998
9
154
353

votes

3 answers

Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set

I'm using spark 1.6 and run into the issue above when I run the following code: // Imports import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SaveMode import…

scala apache-spark apache-spark-sql

asked Jan 11 '16 at 00:46

sparknoob

1,266
14
15

votes

2 answers

How do I convert a WrappedArray column in spark dataframe to Strings?

I am trying to convert a column which contains Array[String] to String, but I consistently get this error org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 78.0 failed 4 times, most recent failure: Lost task 0.3…

scala apache-spark dataframe apache-spark-sql user-defined-functions

asked Dec 30 '15 at 23:53

bdguy

votes

2 answers

What is happening when Spark is calling ShuffleBlockFetcherIterator?

My spark job seems to spend alot of time getting blocks. Sometimes it will do this for an hour or 2. I have 1 partition for my dataset so I'm not sure why its doing so much shuffling. Anyone know what exactly is happening here? 15/12/16 18:05:27…

apache-spark apache-spark-sql

asked Dec 17 '15 at 02:13

Instinct

2,201
1
31
45

votes

6 answers

Flatten Nested Spark Dataframe

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType,…

apache-spark pyspark apache-spark-sql

asked Dec 14 '15 at 15:58

John

1,167
1
16
33

votes

4 answers

PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds

I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in from parquet as a String. I then try to convert it to Timestamp…

python datetime apache-spark apache-spark-sql pyspark

asked Dec 11 '15 at 20:29

PR102012

votes

2 answers

Output from Dataproc Spark job in Google Cloud Logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two…

apache-spark google-cloud-dataproc google-cloud-logging

asked Dec 09 '15 at 18:38

Thomas Oldervoll

votes

2 answers

Does SparkSQL support subquery?

I am running this query in Spark shell but it gives me error, sqlContext.sql( "select sal from samplecsv where sal < (select MAX(sal) from samplecsv)" ).collect().foreach(println) error: java.lang.RuntimeException: [1.47] failure: ``)'' expected…

sql apache-spark subquery apache-spark-sql

asked Nov 26 '15 at 07:49

Rinku Buragohain

Prev 1 2 3

…

99 100 Next