Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

112

votes

3 answers

pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work: # define a dataframe rdd = sc.parallelize([(0,1), (0,1),…

apache-spark filter pyspark apache-spark-sql

asked Nov 04 '16 at 11:44

user3133475

2,951
3
13
11

112

votes

10 answers

Extract column values of Dataframe as List in Apache Spark

I want to convert a string column of a data frame to a list. What I can find from the Dataframe API is RDD, so I tried converting it back to RDD first, and then apply toArray function to the RDD. In this case, the length and SQL work just fine.…

scala apache-spark apache-spark-sql

asked Aug 14 '15 at 00:39

SH Y.

1,709
3
20
21

110

votes

2 answers

How to tune spark executor number, cores and executor memory?

Where do you start to tune the above mentioned params. Do we start with executor memory and get number of executors, or we start with cores and get the executor number. I followed the link. However got a high level idea, but still not sure how or…

apache-spark

asked Jun 17 '16 at 00:03

Ramzy

6,948
6
18
30

110

votes

1 answer

Is there a way to take the first 1000 rows of a Spark Dataframe?

I am using the randomSplitfunction to get a small amount of a dataframe to use in dev purposes and I end up just taking the first df that is returned by this function. val df_subset = data.randomSplit(Array(0.00000001, 0.01), seed = 12345)(0) If I…

scala apache-spark

asked Dec 10 '15 at 16:06

Michael Discenza

3,240
7
30
41

110

votes

9 answers

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating: (df.groupBy("group") .agg({"money":"sum"}) .show(100) ) This will give me: group SUM(money#2L) A …

dataframe apache-spark pyspark apache-spark-sql

asked May 01 '15 at 14:01

cantdutchthis

31,949
17
74
114

107

votes

11 answers

Spark Error - Unsupported class file major version

I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in my anaconda environment and am using PyCharm for development. I've exported to my bash profile: export SPARK_VERSION=`ls…

java python macos apache-spark pyspark

asked Dec 02 '18 at 18:16

shbfy

2,075
3
16
37

107

votes

5 answers

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very…

string apache-spark pyspark split apache-spark-sql

asked Aug 30 '16 at 19:32

Peter Gaultney

3,269
4
16
20

106

votes

6 answers

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name. for( i <- 0 to origCols.length - 1) { df.withColumnRenamed( df.columns(i),…

scala apache-spark dataframe apache-spark-sql

asked Feb 24 '16 at 03:51

Sam

1,227
3
11
13

105

votes

11 answers

How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive? I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive. But I am wondering if I can directly save dataframe to hive

scala apache-spark hive apache-spark-sql

asked Jun 05 '15 at 10:15

Gourav

1,245
2
10
12

103

votes

12 answers

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1",…

apache-spark pyspark apache-spark-sql

asked Jun 19 '17 at 09:54

GeorgeOfTheRF

8,244
23
57
80

102

votes

14 answers

Overwrite specific partitions in spark dataframe write method

I want to overwrite specific partitions instead of all in spark. I am trying the following command: df.write.orc('maprfs:///hdfs-base-path','overwrite',partitionBy='col4') where df is dataframe having the incremental data to be…

apache-spark apache-spark-sql

asked Jul 20 '16 at 18:00

yatin

1,023
2
8
7

101

votes

1 answer

At what situation I can use Dask instead of Apache Spark?

I am currently using Pandas and Spark for data analysis. I found Dask provides parallelized NumPy array and Pandas DataFrame. Pandas is easy and intuitive for doing data analysis in Python. But I find difficulty in handling multiple bigger…

python pandas apache-spark dask

asked Aug 10 '16 at 20:11

Hariprasad

1,611
2
14
19

101

votes

14 answers

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext. If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?

apache-spark config pyspark

asked May 31 '15 at 17:15

whisperstream

1,897
3
20
25

100

votes

5 answers

Apache Spark: How to use pyspark with Python 3

I built Spark 1.4 from the GH development master, and the build went through fine. But when I do a bin/pyspark I get the Python 2.7.9 version. How can I change this?

python python-3.x apache-spark

asked May 16 '15 at 19:19

tchakravarty

10,736
12
72
116

votes

3 answers

How does createOrReplaceTempView work in Spark?

I am new to Spark and Spark SQL. How does createOrReplaceTempView work in Spark? If we register an RDD of objects as a table will spark keep all the data in memory?

apache-spark apache-spark-sql

asked May 16 '17 at 21:26

Abir Chokraborty

1,695
4
15
23

Prev 1 2 3

…

99 100 Next