Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

3 answers

How can I sum multiple columns in a spark dataframe in pyspark?

I've got a list of column names I want to sum columns = ['col1','col2','col3'] How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results) Dataframe with result I want: col1 …

python apache-spark pyspark apache-spark-sql

asked Nov 14 '18 at 10:21

Manrique

2,083
3
15
38

votes

1 answer

Difference between sc.textFile and spark.read.text in Spark

I am trying to read a simple text file into a Spark RDD and I see that there are two ways of doing so : from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").getOrCreate() sc = spark.sparkContext textRDD1 =…

apache-spark rdd

asked Oct 05 '18 at 12:11

Calcutta

1,021
3
16
36

votes

4 answers

Remove rows from dataframe based on condition in pyspark

I have one dataframe with two columns: +--------+-----+ | col1| col2| +--------+-----+ |22 | 12.2| |1 | 2.1| |5 | 52.1| |2 | 62.9| |77 | 33.3| I would like to create a new dataframe which will take only rows where…

apache-spark dataframe pyspark

asked Sep 18 '18 at 23:43

LDropl

votes

1 answer

Pyspark dataframe OrderBy list of columns

I am trying to use OrderBy function in pyspark dataframe before I write into csv but I am not sure to use OrderBy functions if I have a list of columns. Code: Cols = ['col1','col2','col3'] df = df.OrderBy(cols,ascending=False)

python-3.x apache-spark pyspark apache-spark-sql sql-order-by

asked Jun 10 '18 at 12:07

Jack

votes

10 answers

py4j.protocol.Py4JJavaError occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe

I installed apache-spark and pyspark on my machine (Ubuntu), and in Pycharm, I also updated the environment variables (e.g. spark_home, pyspark_python). I'm trying to do: import os, sys os.environ['SPARK_HOME'] =…

python-3.x apache-spark pyspark pycharm py4j

asked Apr 27 '18 at 14:32

Saeid SOHEILY KHAH

votes

3 answers

Keep only duplicates from a DataFrame regarding some field

I have this spark DataFrame: +---+-----+------+----+------------+------------+ | ID| ID2|Number|Name|Opening_Hour|Closing_Hour| +---+-----+------+----+------------+------------+ |ALT| QWA| 6|null| 08:59:00| 23:30:00| |ALT|AUTRE| …

apache-spark pyspark apache-spark-sql

asked Mar 29 '18 at 15:34

Anneso

votes

2 answers

TypeError: Column is not iterable - How to iterate over ArrayType()?

apache-spark pyspark apache-spark-sql

asked Feb 26 '18 at 16:58

pault

41,343
15
107
149

votes

3 answers

How to sum the values of a column in pyspark dataframe

I am working in Pyspark and I have a data frame with the following columns. Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True) Q1.printSchema() root |-- index_date: integer (nullable = true) |-- item_id: integer (nullable =…

apache-spark dataframe sum pyspark

asked Feb 01 '18 at 17:09

Lauren

votes

2 answers

Partition data for efficient joining for Spark dataframe/dataset

I need to join many DataFrames together based on some shared key columns. For a key-value RDD, one can specify a partitioner so that data points with same key are shuffled to same executor so joining is more efficient (if one has shuffle related…

apache-spark apache-spark-sql partitioning apache-spark-dataset

asked Jan 09 '18 at 02:22

Rainfield

1,172
2
14
29

votes

1 answer

How to load streaming data from Amazon SQS?

I use Spark 2.2.0. How can I feed Amazon SQS stream to spark structured stream using pyspark? This question tries to answer it for a non structured streaming and for scala by creating a custom receiver. Is something similar possible in pyspark?…

apache-spark amazon-sqs apache-spark-sql spark-structured-streaming

asked Dec 28 '17 at 12:56

OSK

votes

3 answers

collect() or toPandas() on a large DataFrame in pyspark/EMR

I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One…

pandas apache-spark pyspark emr amazon-emr

asked Nov 28 '17 at 16:13

Rami

8,044
18
66
108

votes

5 answers

Spark: get number of cluster cores programmatically

I run my spark application in yarn cluster. In my code I use number available cores of queue for creating partitions on my dataset: Dataset ds = ... ds.coalesce(config.getNumberOfCores()); My question: how can I get number available cores of queue…

java apache-spark dataset hadoop-yarn core

asked Nov 20 '17 at 18:50

Rougher

votes

3 answers

Why is dataset.count causing a shuffle! (spark 2.2)

Here is my dataframe: The underlying RDD has 2 partitions When I do a df.count, the DAG produced is When I do a df.rdd.count, the DAG produced is: Ques: Count is an action in spark, the official definition is ‘Returns the number of rows in the…

scala apache-spark apache-spark-sql rdd

asked Nov 09 '17 at 04:54

human

2,250
20
24

votes

1 answer

pyspark Window.partitionBy vs groupBy

Lets say I have a dataset with around 2.1 billion records. It's a dataset with customer information and I want to know how many times they did something. So I should group on the ID and sum one column (It has 0 and 1 values where the 1 indicates an…

python apache-spark pyspark apache-spark-sql

asked Nov 08 '17 at 08:20

Anton Mulder

votes

2 answers

Spark is only using one worker machine when more are available

I'm trying to parallelize a machine learning prediction task via Spark. I've used Spark successfully a number of times before on other tasks and have faced no issues with parallelization before. In this particular task, my cluster has 4 workers. I'm…

python apache-spark pyspark

asked Oct 20 '17 at 18:08

Ansari

8,168
2
23
34

Prev 1 2 3

…

99 100 Next