Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

4 answers

Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times…

performance apache-spark pyspark apache-spark-sql user-defined-functions

asked Jul 10 '16 at 21:26

alfredox

4,082
6
21
29

votes

3 answers

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type. How I can change them to int type. I replaced the nan values with 0…

python dataframe apache-spark pyspark apache-spark-sql

asked Oct 26 '17 at 13:43

neha

1,858
5
21
35

votes

8 answers

How to list all cassandra tables

There are many tables in cassandra database, which contain column titled user_id. The values user_id are referred to user stored in table users. As some users are deleted, I would like to delete orphan records in all tables that contain column…

scala apache-spark cassandra spark-cassandra-connector

asked Aug 01 '16 at 10:16

Niko Gamulin

66,025
95
221
286

votes

5 answers

How to use Column.isin with list?

val items = List("a", "b", "c") sqlContext.sql("select c1 from table") .filter($"c1".isin(items)) .collect .foreach(println) The code above throws the following exception. Exception in thread "main"…

scala apache-spark apache-spark-sql

asked Sep 13 '15 at 16:32

Nabegh

3,249
6
25
26

votes

9 answers

how to make saveAsTextFile NOT split output into multiple file?

When using Scala in Spark, whenever I dump the results out using saveAsTextFile, it seems to split the output into multiple parts. I'm just passing a parameter(path) to it. val year =…

scala apache-spark

asked Jun 23 '14 at 16:52

user2773013

3,102
8
38
58

votes

9 answers

How to prevent java.lang.OutOfMemoryError: PermGen space at Scala compilation?

I have noticed a strange behavior of my scala compiler. It occasionally throws an OutOfMemoryError when compiling a class. Here's the error message: [info] Compiling 1 Scala source to…

scala apache-spark memory-management sbt scalatra-sbt

asked Nov 30 '11 at 18:41

BumbleGee

2,031
3
18
18

votes

7 answers

How to loop through each row of dataFrame in pyspark

E.g sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further…

apache-spark dataframe for-loop pyspark apache-spark-sql

asked Apr 01 '16 at 06:15

Arti Berde

1,182
1
11
23

votes

3 answers

How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages…

python scala apache-spark apache-spark-sql pyspark

asked Jun 22 '15 at 15:30

zero323

322,348
103
959
935

votes

4 answers

Which cluster type should I choose for Spark?

I am new to Apache Spark, and I just learned that Spark supports three types of cluster: Standalone - meaning Spark will manage its own cluster YARN - using Hadoop's YARN resource manager Mesos - Apache's dedicated resource manager project I think…

apache-spark hadoop-yarn mesos apache-spark-standalone

asked Feb 22 '15 at 23:44

David S.

10,578
12
62
104

votes

8 answers

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name. But how to do the same when it's a column of Spark dataframe? E.g. the calling program has a Spark dataframe: spark_df >>> spark_df.columns ['admit', 'gre', 'gpa', 'rank'] This program calls my function:…

dataframe apache-spark pyspark apache-spark-sql

asked Sep 28 '16 at 11:55

Kaushik Acharya

1,520
2
16
25

votes

8 answers

Median / quantiles within PySpark groupBy

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg, so that I can mix it with other PySpark…

apache-spark pyspark group-by apache-spark-sql median

asked Oct 20 '17 at 08:58

abeboparebop

7,396
6
37
46

votes

4 answers

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'. I have tried: import pyspark.sql.functions as…

python apache-spark pyspark apache-spark-sql

asked Jan 27 '17 at 08:49

gaatjeniksaan

1,412
2
12
17

votes

12 answers

Rename more than one column using withColumnRenamed

I want to change names of two columns using spark withColumnRenamed function. Of course, I can write: data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2',…

apache-spark pyspark apache-spark-sql rename

asked Aug 05 '16 at 22:30

user2280549

1,204
2
12
19

votes

4 answers

PySpark: java.lang.OutofMemoryError: Java heap space

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code: train_dataRDD = (train.map(lambda…

java apache-spark out-of-memory heap-memory pyspark

asked Sep 01 '15 at 16:45

pg2455

5,039
14
51
78

votes

3 answers

Querying Spark SQL DataFrame with complex types

How Can I query an RDD with complex types such as maps/arrays? for example, when I was writing this test code: case class Test(name: String, map: Map[String, String]) val map = Map("hello" -> "world", "hey" -> "there") val map2 = Map("hello" ->…

sql scala apache-spark dataframe apache-spark-sql

asked Feb 04 '15 at 22:12

dvir

2,546
2
18
15

Prev 1 2 3

…

99 100 Next