Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

3 answers

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York",…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Feb 09 '17 at 13:49

Arthur Tacca

8,833
2
31
49

votes

1 answer

Spark code organization and best practices

So, having spend many years in an object oriented world with code reuse, design patterns and best practices always taken into account, I find myself struggling somewhat with code organization and code reuse in world of Spark. If I try to write code…

apache-spark functional-programming code-organization

asked Sep 25 '15 at 07:30

Glennie Helles Sindholt

12,816
5
44
50

votes

2 answers

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only in the first 2 columns - Column "a" and "b": a | b | c | 1 | 2 | 4 …

apache-spark pyspark apache-spark-sql

asked Jul 12 '17 at 19:02

Rakesh Adhikesavan

11,966
18
51
76

votes

4 answers

Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?

I am using Spark 1.5. I have two dataframes of the form: scala> libriFirstTable50Plus3DF res1: org.apache.spark.sql.DataFrame = [basket_id: string, family_id: int] scala> linkPersonItemLessThan500DF res2: org.apache.spark.sql.DataFrame =…

scala apache-spark join apache-spark-sql

asked Dec 13 '16 at 14:43

Christos Hadjinikolis

2,099
3
20
46

votes

6 answers

What does setMaster `local[*]` mean in spark?

I found some code to start spark locally with: val conf = new SparkConf().setAppName("test").setMaster("local[*]") val ctx = new SparkContext(conf) What does the [*] mean?

scala apache-spark

asked Sep 02 '15 at 14:38

Freewind

193,756
157
432
708

votes

15 answers

How to flatten a struct in a Spark dataframe?

java apache-spark pyspark apache-spark-sql

asked Aug 03 '16 at 21:24

djWann

2,017
4
31
36

votes

14 answers

How do I skip a header from CSV files in Spark?

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers? val rdd=sc.textFile("file1,file2,file3") Now, how can we skip header lines from this rdd?

scala csv apache-spark

asked Jan 09 '15 at 06:21

Hafiz Mujadid

1,565
1
15
27

votes

10 answers

How to avoid duplicate columns after join?

I have two dataframes with the following columns: df1.columns // Array(ts, id, X1, X2) and df2.columns // Array(ts, id, Y1, Y2) After I do val df_combined = df1.join(df2, Seq(ts,id)) I end up with the following columns: Array(ts, id, X1, X2,…

scala apache-spark apache-spark-sql

asked Feb 07 '16 at 20:10

Neel

9,913
16
52
74

votes

7 answers

Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not…

dataframe apache-spark pyspark rdd

asked Apr 09 '15 at 11:42

Interfector

1,868
1
23
43

votes

6 answers

Converting Pandas dataframe into Spark dataframe error

I'm trying to convert Pandas DF into Spark one. DF…

python pandas apache-spark apache-spark-sql

asked May 29 '16 at 18:19

Ivan Sudos

1,423
2
13
25

votes

6 answers

Spark : how to run spark file from spark shell

I am using CDH 5.2. I am able to use spark-shell to run the commands. How can I run the file(file.spark) which contain spark commands. Is there any way to run/compile the scala programs in CDH 5.2 without sbt?

scala apache-spark cloudera-cdh cloudera-manager

asked Dec 31 '14 at 06:52

Ramakrishna

1,170
2
10
17

votes

2 answers

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this? In my current use case, I have a list of addresses that I want to normalize. For example this dataframe: id …

python apache-spark pyspark

asked May 04 '16 at 21:06

Luke

6,699
13
50
88

votes

5 answers

How to query JSON data column using Spark DataFrames?

I have a Cassandra table that for simplicity looks something like: key: text jsonData: text blobData: blob I can create a basic data frame for this using spark and the spark-cassandra-connector using: val df = sqlContext.read …

dataframe apache-spark apache-spark-sql cassandra spark-cassandra-connector

asked Dec 03 '15 at 15:03

JDesuv

1,034
2
9
19

votes

5 answers

reduceByKey: How does it work internally?

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code: val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a +…

scala apache-spark rdd

asked May 09 '15 at 21:43

user764186

votes

4 answers

How to split Vector into columns - using PySpark

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT. An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert |…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Jul 14 '16 at 21:12

sedioben

Prev 1 2 3

…

99 100 Next