Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

3 answers

How to Setup SPARK_HOME variable?

Following the steps of Sparkling Water from the link http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.2/0/index.html. Running in terminal : ~/InstallFile/SparklingWater/sparkling-water-2.2.0$ bin/sparkling-shell --conf…

apache-spark h2o sparkling-water

asked Oct 06 '17 at 20:42

roshan_ray

votes

2 answers

Write DataFrame to mysql table using pySpark

I am attempting to insert records into a MySql table. The table contains id and name as columns. I am doing like below in a pyspark shell. name = 'tester_1' id = '103' import pandas as pd l = [id,name] df =…

python mysql apache-spark pyspark apache-spark-sql

asked Oct 03 '17 at 19:39

User12345

5,180
14
58
105

votes

2 answers

Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK. I was getting OOM Java heap space during the job. Then, when I…

scala apache-spark caching memory rdd

asked Sep 27 '17 at 23:03

Matek

votes

6 answers

Spark Shell "Failed to Initialize Compiler" Error on a mac

I just installed spark on my new machine and get the following error after installing Java, Scala and Apache-spark using homebrew. The install process is given below: $ brew cask install java $ brew install scala $ brew install apache-spark Once…

macos scala apache-spark installation

asked Sep 26 '17 at 22:44

lordlabakdas

1,163
5
18
33

votes

2 answers

Does Spark preserve record order when reading in ordered files?

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just…

apache-spark

asked Aug 22 '17 at 15:55

Jason Evans

1,197
1
13
22

votes

2 answers

How to execute spark submit on amazon EMR from Lambda function?

I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function. Most of the…

amazon-web-services apache-spark aws-lambda amazon-emr spark-submit

asked Aug 21 '17 at 11:19

Satyam

votes

3 answers

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag. I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas? How to do it without using withcolumn() to create new column and…

apache-spark dataframe pyspark apache-spark-sql apache-spark-mllib

asked Jun 19 '17 at 06:21

GeorgeOfTheRF

8,244
23
57
80

votes

1 answer

Why agg() in PySpark is only able to summarize one column of a DataFrame at a time?

For the below dataframe df = spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)], schema=['name','High']) When I try to find min & max I am only getting min value in…

python apache-spark pyspark apache-spark-sql

asked Jun 06 '17 at 07:41

GeorgeOfTheRF

8,244
23
57
80

votes

5 answers

Retain keys with null values while writing JSON in spark

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained? code to write the…

java json apache-spark apache-spark-sql

asked May 30 '17 at 20:44

Vaishak Suresh

5,735
10
41
66

votes

6 answers

Apache spark error: not found: value sqlContext

I am trying to set up spark in Windows 10. Initially, I faced this error while starting and the solution in the link helped. Now I am still not able to run import sqlContext.sql as it still throws me an…

scala apache-spark

asked Mar 24 '17 at 07:13

SoakingHummer

votes

1 answer

How to list all tables in database using Spark SQL?

I have a SparkSQL connection to an external database: from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .getOrCreate() If I know the name of a table, it's easy to…

apache-spark pyspark apache-spark-sql

asked Mar 18 '17 at 21:30

Abe

22,738
26
82
111

votes

3 answers

How to read gz compressed file by pyspark

I have line data in .gz compressed format. I have to read it in pyspark Following is the code snippet rdd = sc.textFile("data/label.gz").map(func) But I could not read the above file successfully. How do I read gz compressed file. I have found a…

python apache-spark pyspark

asked Mar 13 '17 at 10:58

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

1 answer

PySpark: when function with multiple outputs

I am trying to use a "chained when" function. In other words, I'd like to get more than two outputs. I tried using the same logic of the concatenate IF function in Excel: df.withColumn("device_id",…

python apache-spark pyspark apache-spark-sql

asked Mar 01 '17 at 16:31

Fede

votes

1 answer

How to interpret results of Spark OneHotEncoder

I read the OHE entry from Spark docs, One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to…

python apache-spark pyspark one-hot-encoding

asked Feb 17 '17 at 10:05

Maria

votes

1 answer

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"), (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")). toDF("col1", "col2", "col3")) So I have a spark dataframe with 3…

scala apache-spark

asked Jan 20 '17 at 19:53

Ramesh

1,563
9
25
39

Prev 1 2 3

…

99 100 Next