Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
16
votes
3 answers

How can I sum multiple columns in a spark dataframe in pyspark?

I've got a list of column names I want to sum columns = ['col1','col2','col3'] How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results) Dataframe with result I want: col1 …
Manrique
  • 2,083
  • 3
  • 15
  • 38
16
votes
1 answer

Difference between sc.textFile and spark.read.text in Spark

I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so : from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").getOrCreate() sc = spark.sparkContext textRDD1 =…
Calcutta
  • 1,021
  • 3
  • 16
  • 36
16
votes
4 answers

Remove rows from dataframe based on condition in pyspark

I have one dataframe with two columns: +--------+-----+ | col1| col2| +--------+-----+ |22 | 12.2| |1 | 2.1| |5 | 52.1| |2 | 62.9| |77 | 33.3| I would like to create a new dataframe which will take only rows where…
LDropl
  • 846
  • 3
  • 9
  • 25
16
votes
1 answer

Pyspark dataframe OrderBy list of columns

I am trying to use OrderBy function in pyspark dataframe before I write into csv but I am not sure to use OrderBy functions if I have a list of columns. Code: Cols = ['col1','col2','col3'] df = df.OrderBy(cols,ascending=False)
Jack
  • 957
  • 3
  • 10
  • 23
16
votes
10 answers

py4j.protocol.Py4JJavaError occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

I installed apache-spark and pyspark on my machine (Ubuntu), and in Pycharm, I also updated the environment variables (e.g. spark_home, pyspark_python). I'm trying to do: import os, sys os.environ['SPARK_HOME'] =…
Saeid SOHEILY KHAH
  • 747
  • 3
  • 10
  • 23
16
votes
3 answers

Keep only duplicates from a DataFrame regarding some field

I have this spark DataFrame: +---+-----+------+----+------------+------------+ | ID| ID2|Number|Name|Opening_Hour|Closing_Hour| +---+-----+------+----+------------+------------+ |ALT| QWA| 6|null| 08:59:00| 23:30:00| |ALT|AUTRE| …
Anneso
  • 583
  • 2
  • 11
  • 20
16
votes
2 answers

TypeError: Column is not iterable - How to iterate over ArrayType()?

Consider the following DataFrame: +------+-----------------------+ |type |names | +------+-----------------------+ |person|[john, sam, jane] | |pet |[whiskers, rover, fido]| +------+-----------------------+ Which can be…
pault
  • 41,343
  • 15
  • 107
  • 149
16
votes
3 answers

How to sum the values of a column in pyspark dataframe

I am working in Pyspark and I have a data frame with the following columns. Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True) Q1.printSchema() root |-- index_date: integer (nullable = true) |-- item_id: integer (nullable =…
Lauren
  • 337
  • 1
  • 2
  • 4
16
votes
2 answers

Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related…
16
votes
1 answer

How to load streaming data from Amazon SQS?

I use Spark 2.2.0. How can I feed Amazon SQS stream to spark structured stream using pyspark? This question tries to answer it for a non structured streaming and for scala by creating a custom receiver. Is something similar possible in pyspark?…
16
votes
3 answers

collect() or toPandas() on a large DataFrame in pyspark/EMR

I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One…
Rami
  • 8,044
  • 18
  • 66
  • 108
16
votes
5 answers

Spark: get number of cluster cores programmatically

I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset: Dataset ds = ... ds.coalesce(config.getNumberOfCores()); My question: how can I get number available cores of queue…
Rougher
  • 834
  • 5
  • 19
  • 46
16
votes
3 answers

Why is dataset.count causing a shuffle! (spark 2.2)

Here is my dataframe: The underlying RDD has 2 partitions When I do a df.count, the DAG produced is When I do a df.rdd.count, the DAG produced is: Ques: Count is an action in spark, the official definition is ‘Returns the number of rows in the…
human
  • 2,250
  • 20
  • 24
16
votes
1 answer

pyspark Window.partitionBy vs groupBy

Lets say I have a dataset with around 2.1 billion records. It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an…
Anton Mulder
  • 215
  • 2
  • 3
  • 8
16
votes
2 answers

Spark is only using one worker machine when more are available

I'm trying to parallelize a machine learning prediction task via Spark. I've used Spark successfully a number of times before on other tasks and have faced no issues with parallelization before. In this particular task, my cluster has 4 workers. I'm…
Ansari
  • 8,168
  • 2
  • 23
  • 34