Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

1 answer

Exception with Table identified via AWS Glue Crawler and stored in Data Catalog

I'm working to build the new data lake of the company and are trying to find the best and the most recent option to work here. So, I found a pretty nice solution to work with EMR + S3 + Athena + Glue. The process that I did was: 1 - Run Apache Spark…

amazon-web-services apache-spark amazon-s3 amazon-emr aws-glue

asked Aug 18 '17 at 04:55

Thiago Baldim

7,362
3
29
51

votes

3 answers

Connecting to remote master on standalone Spark

I launch Spark in standalone mode on my remote server via following next steps: cp spark-env.sh.template spark-env.sh append to spark-env.sh SPARK_MASTER_HOST=IP_OF_MY_REMOTE_SERVER and run next commands for standalone mode: sbin/start-master.sh…

scala apache-spark

asked Aug 15 '17 at 20:08

pacman

votes

1 answer

Executing separate streaming queries in spark structured streaming

I am trying to aggregate stream with two different windows and printing it into the console. However only the first streaming query is being printed. The tenSecsQ is not printed into the console. SparkSession spark = SparkSession .builder() …

apache-spark spark-structured-streaming

asked Aug 10 '17 at 16:01

atom

votes

2 answers

Spark Data frame search column starting with a string

I have a requirement to filter a data frame based on a condition that a column value should starts with a predefined string. I am trying following: val domainConfigJSON = sqlContext.read .jdbc(url, "CONFIG", prop) .select("DID", "CONF",…

apache-spark apache-spark-sql

asked Aug 07 '17 at 17:13

Anush

votes

9 answers

Scala Error: Could not find or load main class in both Scala IDE and Eclipse

Here is my problem, I know there are lots of answers for similar questions, however none of them worked after I tried. I'm using both Scala IDE 4.6 and eclipse Oxygen to run the code and all failed on this error. Here's my scala compiler…

java eclipse scala apache-spark scala-ide

asked Aug 07 '17 at 05:47

SKSKSKSK

votes

3 answers

Spark: Most efficient way to sort and partition data to be written as parquet

My data is in principle a table, which contains a column ID and a column GROUP_ID, besides other 'data'. In the first step I am reading CSV's into Spark, do some processing to prepare the data for the second step, and write the data as parquet. The…

apache-spark pyspark apache-spark-sql

asked Jul 20 '17 at 20:43

akoeltringer

1,671
3
19
34

votes

2 answers

How to use collect_set and collect_list functions in windowed aggregation in Spark 1.6?

In Spark 1.6.0 / Scala, is there an opportunity to get collect_list("colC") or collect_set("colC").over(Window.partitionBy("colA").orderBy("colB")?

scala apache-spark apache-spark-sql apache-spark-1.6

asked Jul 16 '17 at 17:27

Dzmitry Haikov

votes

3 answers

Spark SQL change format of the number

scala apache-spark apache-spark-sql

asked Jul 10 '17 at 08:51

Cherry

31,309
66
224
364

votes

2 answers

Empty output for Watermarked Aggregation Query in Append Mode

I use Spark 2.2.0-rc1. I've got a Kafka topic which I'm querying a running watermarked aggregation, with a 1 minute watermark, giving out to console with append output mode. import org.apache.spark.sql.types._ val schema =…

scala apache-spark spark-structured-streaming

asked Jun 07 '17 at 04:45

himanshuIIITian

5,985
6
50
70

votes

2 answers

How to run Scala script using spark-submit (similarly to Python script)?

I try to execute a simple Scala script using Spark as described in the Spark Quick Start Tutorial. I have not troubles to execute the following Python code: """SimpleApp.py""" from pyspark import SparkContext logFile = "tmp.txt" # Should be some…

scala apache-spark

asked Jun 03 '17 at 17:31

Roman

124,451
167
349
456

votes

2 answers

What is StringIndexer , VectorIndexer, and how to use them?

Dataset dataFrame = ... ; StringIndexerModel labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(dataFrame); VectorIndexerModel featureIndexer = new…

apache-spark dataset apache-spark-sql

asked May 26 '17 at 07:00

Manikandan Balasubramanian

1,079
4
14
27

votes

6 answers

Spark Scala Split dataframe into equal number of rows

I have a Dataframe and wish to divide it into an equal number of rows. In other words, I want a list of dataframes where each one is a disjointed subset of the original dataframe. Let's say the input dataframer is the following: …

scala apache-spark dataframe

asked May 23 '17 at 12:59

Alessandro La Corte

votes

2 answers

When are cache and persist executed (since they don't seem like actions)?

I am implementing a spark application, of which below is a sample snippet(Not the exact same code): val rdd1 = sc.textfile(HDFS_PATH) val rdd2 = rdd1.map(func) rdd2.persist(StorageLevel.MEMORY_AND_DISK) println(rdd2.count) On checking the…

scala apache-spark lazy-evaluation

asked May 16 '17 at 12:56

Ankit Khettry

votes

2 answers

How to split pipe-separated column into multiple rows?

I have a dataframe that contains the following: movieId / movieName / genre 1 example1 action|thriller|romance 2 example2 fantastic|action I would like to obtain a second dataframe (from the first one), that contains the…

apache-spark apache-spark-sql

asked May 14 '17 at 13:41

Lechucico

1,914
7
27
60

votes

1 answer

Apache Spark: User Memory vs Spark Memory

I'm building a Spark application where I have to cache about 15 GB of CSV files. I read about the new UnifiedMemoryManager introduced in Spark 1.6 here: https://0x0fff.com/spark-memory-management/ It shows also this picture: The author differs…

caching apache-spark memory memory-management rdd

asked May 03 '17 at 09:46

D. Müller

3,336
4
36
84

Prev 1 2 3

…

100