Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
16
votes
2 answers

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches. The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc…
Lawrence Benson
  • 1,398
  • 1
  • 16
  • 33
16
votes
4 answers

What is a glom?. How it is different from mapPartitions?

I've come across the glom() method on RDD. As per the documentation Return an RDD created by coalescing all elements within each partition into an array Does glom shuffle the data across the partitions or does it only return the partition data as…
nagendra
  • 1,885
  • 3
  • 17
  • 27
16
votes
5 answers

Spark : check your cluster UI to ensure that workers are registered

I have a simple program in Spark: /* SimpleApp.scala */ import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val conf = new…
vineet sinha
  • 317
  • 1
  • 4
  • 12
16
votes
1 answer

Use collect_list and collect_set in Spark SQL

According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image. I'm trying to do this in Scala: import…
JFX
  • 432
  • 1
  • 4
  • 10
16
votes
1 answer

How do I collect a single column in Spark?

I would like to perform an action on a single column. Unfortunately, after I transform that column, it is now no longer a part of the dataframe it came from but a Column object. As such, it cannot be collected. Here is an example: df =…
Michal
  • 1,863
  • 7
  • 30
  • 50
16
votes
1 answer

Applying function to Spark Dataframe Column

Coming from R, I am used to easily doing operations on columns. Is there any easy way to take this function that I've written in scala def round_tenths_place( un_rounded:Double ) : Double = { val rounded = BigDecimal(un_rounded).setScale(1,…
16
votes
5 answers

How to compute cumulative sum using Spark

I have an rdd of (String,Int) which is sorted by key val data = Array(("c1",6), ("c2",3),("c3",4)) val rdd = sc.parallelize(data).sortByKey Now I want to start the value for the first key with zero and the subsequent keys as sum of the previous…
Knight71
  • 2,927
  • 5
  • 37
  • 63
16
votes
1 answer

Spark ML indexer cannot resolve DataFrame column name with dots?

I have a DataFrame with a column named a.b. When I specify a.b as the input column name to a StringIndexer, AnalysisException with the message "cannot resolve 'a.b' given input columns a.b". I'm using Spark 1.6.0. I'm aware that older versions of…
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
16
votes
3 answers

Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set

I'm using spark 1.6 and run into the issue above when I run the following code: // Imports import org.apache.spark.sql.hive.HiveContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SaveMode import…
sparknoob
  • 1,266
  • 14
  • 15
16
votes
2 answers

How do I convert a WrappedArray column in spark dataframe to Strings?

I am trying to convert a column which contains Array[String] to String, but I consistently get this error org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 78.0 failed 4 times, most recent failure: Lost task 0.3…
16
votes
2 answers

What is happening when Spark is calling ShuffleBlockFetcherIterator?

My spark job seems to spend alot of time getting blocks. Sometimes it will do this for an hour or 2. I have 1 partition for my dataset so I'm not sure why its doing so much shuffling. Anyone know what exactly is happening here? 15/12/16 18:05:27…
Instinct
  • 2,201
  • 1
  • 31
  • 45
16
votes
6 answers

Flatten Nested Spark Dataframe

Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType,…
John
  • 1,167
  • 1
  • 16
  • 33
16
votes
4 answers

PySpark 1.5 How to Truncate Timestamp to Nearest Minute from seconds

I am using PySpark. I have a column ('dt') in a dataframe ('canon_evt') that this a timestamp. I am trying to remove seconds from a DateTime value. It is originally read in from parquet as a String. I then try to convert it to Timestamp…
PR102012
  • 846
  • 2
  • 11
  • 30
16
votes
2 answers

Output from Dataproc Spark job in Google Cloud Logging

Is there a way to have the output from Dataproc Spark jobs sent to Google Cloud logging? As explained in the Dataproc docs the output from the job driver (the master for a Spark job) is available under Dataproc->Jobs in the console. There are two…
16
votes
2 answers

Does SparkSQL support subquery?

I am running this query in Spark shell but it gives me error, sqlContext.sql( "select sal from samplecsv where sal < (select MAX(sal) from samplecsv)" ).collect().foreach(println) error: java.lang.RuntimeException: [1.47] failure: ``)'' expected…
Rinku Buragohain
  • 197
  • 1
  • 2
  • 9