Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
16
votes
3 answers

How to Setup SPARK_HOME variable?

Following the steps of Sparkling Water from the link http://h2o-release.s3.amazonaws.com/sparkling-water/rel-2.2/0/index.html. Running in terminal : ~/InstallFile/SparklingWater/sparkling-water-2.2.0$ bin/sparkling-shell --conf…
roshan_ray
  • 197
  • 1
  • 1
  • 9
16
votes
2 answers

Write DataFrame to mysql table using pySpark

I am attempting to insert records into a MySql table. The table contains id and name as columns. I am doing like below in a pyspark shell. name = 'tester_1' id = '103' import pandas as pd l = [id,name] df =…
User12345
  • 5,180
  • 14
  • 58
  • 105
16
votes
2 answers

Spark - StorageLevel (DISK_ONLY vs MEMORY_AND_DISK) and Out of memory Java heap space

Lately I've been running a memory-heavy spark job and started to wonder about storage levels of spark. I persisted one of my RDDs as it was used twice using StorageLevel.MEMORY_AND_DISK. I was getting OOM Java heap space during the job. Then, when I…
Matek
  • 641
  • 5
  • 16
16
votes
6 answers

Spark Shell "Failed to Initialize Compiler" Error on a mac

I just installed spark on my new machine and get the following error after installing Java, Scala and Apache-spark using homebrew. The install process is given below: $ brew cask install java $ brew install scala $ brew install apache-spark Once…
lordlabakdas
  • 1,163
  • 5
  • 18
  • 33
16
votes
2 answers

Does Spark preserve record order when reading in ordered files?

I'm using Spark to read in records (in this case in csv files) and process them. The files are already in some order, but this order isn't reflected by any column (think of it as a time series, but without any timestamp column -- each row is just…
Jason Evans
  • 1,197
  • 1
  • 13
  • 22
16
votes
2 answers

How to execute spark submit on amazon EMR from Lambda function?

I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function. Most of the…
16
votes
3 answers

How to overwrite entire existing column in Spark dataframe with new column?

I want to overwrite a spark column with a new column which is a binary flag. I tried directly overwriting the column id2 but why is it not working like a inplace operation in Pandas? How to do it without using withcolumn() to create new column and…
16
votes
1 answer

Why agg() in PySpark is only able to summarize one column of a DataFrame at a time?

For the below dataframe df = spark.createDataFrame(data=[('Alice',4.300),('Bob',7.677)], schema=['name','High']) When I try to find min & max I am only getting min value in…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
16
votes
5 answers

Retain keys with null values while writing JSON in spark

I am trying to write a JSON file using spark. There are some keys that have null as value. These show up just fine in the DataSet, but when I write the file, the keys get dropped. How do I ensure they are retained? code to write the…
Vaishak Suresh
  • 5,735
  • 10
  • 41
  • 66
16
votes
6 answers

Apache spark error: not found: value sqlContext

I am trying to set up spark in Windows 10. Initially, I faced this error while starting and the solution in the link helped. Now I am still not able to run import sqlContext.sql as it still throws me an…
SoakingHummer
  • 562
  • 1
  • 7
  • 25
16
votes
1 answer

How to list all tables in database using Spark SQL?

I have a SparkSQL connection to an external database: from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .getOrCreate() If I know the name of a table, it's easy to…
Abe
  • 22,738
  • 26
  • 82
  • 111
16
votes
3 answers

How to read gz compressed file by pyspark

I have line data in .gz compressed format. I have to read it in pyspark Following is the code snippet rdd = sc.textFile("data/label.gz").map(func) But I could not read the above file successfully. How do I read gz compressed file. I have found a…
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
16
votes
1 answer

PySpark: when function with multiple outputs

I am trying to use a "chained when" function. In other words, I'd like to get more than two outputs. I tried using the same logic of the concatenate IF function in Excel: df.withColumn("device_id",…
Fede
  • 173
  • 1
  • 1
  • 6
16
votes
1 answer

How to interpret results of Spark OneHotEncoder

I read the OHE entry from Spark docs, One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to…
Maria
  • 195
  • 1
  • 11
16
votes
1 answer

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"), (2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")). toDF("col1", "col2", "col3")) So I have a spark dataframe with 3…
Ramesh
  • 1,563
  • 9
  • 25
  • 39