Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
77
votes
3 answers

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York",…
Arthur Tacca
  • 8,833
  • 2
  • 31
  • 49
77
votes
1 answer

Spark code organization and best practices

So, having spend many years in an object oriented world with code reuse, design patterns and best practices always taken into account, I find myself struggling somewhat with code organization and code reuse in world of Spark. If I try to write code…
76
votes
2 answers

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only in the first 2 columns - Column "a" and "b": a | b | c | 1 | 2 | 4 …
Rakesh Adhikesavan
  • 11,966
  • 18
  • 51
  • 76
76
votes
4 answers

Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?

I am using Spark 1.5. I have two dataframes of the form: scala> libriFirstTable50Plus3DF res1: org.apache.spark.sql.DataFrame = [basket_id: string, family_id: int] scala> linkPersonItemLessThan500DF res2: org.apache.spark.sql.DataFrame =…
Christos Hadjinikolis
  • 2,099
  • 3
  • 20
  • 46
76
votes
6 answers

What does setMaster `local[*]` mean in spark?

I found some code to start spark locally with: val conf = new SparkConf().setAppName("test").setMaster("local[*]") val ctx = new SparkContext(conf) What does the [*] mean?
Freewind
  • 193,756
  • 157
  • 432
  • 708
75
votes
15 answers

How to flatten a struct in a Spark dataframe?

I have a dataframe with the following structure: |-- data: struct (nullable = true) | |-- id: long (nullable = true) | |-- keyNote: struct (nullable = true) | | |-- key: string (nullable = true) | | |-- note: string (nullable…
djWann
  • 2,017
  • 4
  • 31
  • 36
74
votes
14 answers

How do I skip a header from CSV files in Spark?

Suppose I give three files paths to a Spark context to read and each file has a schema in the first row. How can we skip schema lines from headers? val rdd=sc.textFile("file1,file2,file3") Now, how can we skip header lines from this rdd?
Hafiz Mujadid
  • 1,565
  • 1
  • 15
  • 27
73
votes
10 answers

How to avoid duplicate columns after join?

I have two dataframes with the following columns: df1.columns // Array(ts, id, X1, X2) and df2.columns // Array(ts, id, Y1, Y2) After I do val df_combined = df1.join(df2, Seq(ts,id)) I end up with the following columns: Array(ts, id, X1, X2,…
Neel
  • 9,913
  • 16
  • 52
  • 74
73
votes
7 answers

Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not…
Interfector
  • 1,868
  • 1
  • 23
  • 43
72
votes
6 answers

Converting Pandas dataframe into Spark dataframe error

I'm trying to convert Pandas DF into Spark one. DF…
Ivan Sudos
  • 1,423
  • 2
  • 13
  • 25
72
votes
6 answers

Spark : how to run spark file from spark shell

I am using CDH 5.2. I am able to use spark-shell to run the commands. How can I run the file(file.spark) which contain spark commands. Is there any way to run/compile the scala programs in CDH 5.2 without sbt?
Ramakrishna
  • 1,170
  • 2
  • 10
  • 17
71
votes
2 answers

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this? In my current use case, I have a list of addresses that I want to normalize. For example this dataframe: id …
Luke
  • 6,699
  • 13
  • 50
  • 88
71
votes
5 answers

How to query JSON data column using Spark DataFrames?

I have a Cassandra table that for simplicity looks something like: key: text jsonData: text blobData: blob I can create a basic data frame for this using spark and the spark-cassandra-connector using: val df = sqlContext.read …
71
votes
5 answers

reduceByKey: How does it work internally?

I am new to Spark and Scala. I was confused about the way reduceByKey function works in Spark. Suppose we have the following code: val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a +…
user764186
  • 951
  • 2
  • 9
  • 12
70
votes
4 answers

How to split Vector into columns - using PySpark

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT. An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert |…
sedioben
  • 935
  • 1
  • 10
  • 16