Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

7 answers

spark ssc.textFileStream is not streamining any files from directory

I am trying to execute below code using eclipse (with maven conf) with 2 worker and each have 2 core or also tried with spark-submit. public class StreamingWorkCount implements Serializable { public static void main(String[] args) { …

filesystems apache-spark spark-streaming data-stream

asked Jan 22 '15 at 16:33

Kaushal

3,237
3
29
48

votes

3 answers

Extract information from a `org.apache.spark.sql.Row`

I have Array[org.apache.spark.sql.Row] returned by sqc.sql(sqlcmd).collect(): Array([10479,6,10], [8975,149,640], ...) I can get the individual values: scala> pixels(0)(0) res34: Any = 10479 but they are Any, not Int. How do I extract them as…

scala apache-spark apache-spark-sql

asked Jan 20 '15 at 00:21

sds

58,617
29
161
278

votes

2 answers

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()? for examples: output = sqlContext.sql("SELECT * From people") We can use output.cache() to cache the result, but then we cannot use sql query to deal with it. So I want…

caching query-optimization apache-spark

asked Jan 19 '15 at 14:42

lwwwzh

votes

2 answers

How to configure Apache Spark random worker ports for tight firewalls?

I am using Apache Spark to run machine learning algorithms and other big data tasks. Previously, I was using spark cluster standalone mode running spark master and worker on the same machine. Now, I added multiple worker machines and due to a tight…

configuration apache-spark worker ports

asked Jan 01 '15 at 07:21

Isma Khan

votes

1 answer

Apache Spark - Dealing with Sliding Windows on Temporal RDDs

I've been working quite a lot with Apache Spark the last few months but now I have received a pretty difficult task, to compute average/minimum/maximum etcetera on a sliding window over a paired RDD where the Key component is a date tag and the…

algorithm scala apache-spark

asked Dec 17 '14 at 12:07

Johan S

3,531
6
35
63

votes

3 answers

How to read multiple gzipped files from S3 into a single RDD?

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is…

amazon-s3 apache-spark

asked Dec 15 '14 at 05:10

shihpeng

5,283
6
37
63

votes

3 answers

Spark: run InputFormat as singleton

I'm trying to integrate a key-value database to Spark and have some questions. I'm a Spark beginner, have read a lot and run some samples but nothing too complex. Scenario: I'm using a small hdfs cluster to store incoming messages in a database. The…

database hadoop apache-spark

asked Oct 30 '14 at 13:47

cruppstahl

2,447
1
19
25

votes

3 answers

Sampling a large distributed data set using pyspark / spark

I have a file in hdfs which is distributed across the nodes in the cluster. I'm trying to get a random sample of 10 lines from this file. in the pyspark shell, I read the file into an RDD using: >>> textFile =…

hadoop apache-spark

asked Jul 17 '14 at 14:17

mgoldwasser

14,558
15
79
103

votes

5 answers

Databricks: Issue while creating spark data frame from pandas

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my…

python pandas apache-spark databricks iteritems

asked Apr 04 '23 at 07:32

data en

votes

2 answers

How is ColumnarToRow an efficient operation in Spark

In my understanding columnar format is better for MapReduce tasks. Even for something like selection of some columns, columnar works well as we don't have to load other columns into memory. But in Spark 3.0 I'm seeing this ColumnarToRow operation…

apache-spark pyspark apache-spark-sql query-optimization

asked Nov 11 '20 at 18:31

kar09

votes

1 answer

Does PySpark code run in JVM or Python subprocess?

I want to understand what is happening under the hood when I run the following script named t1.py with python3 t1.py. Specifically, I have the following questions: What kind of code is submitted to the spark worker node? Is it the python code or a…

python apache-spark pyspark

asked May 15 '20 at 09:41

Charles Ju

1,095
1
9
28

votes

2 answers

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to…

apache-spark pyspark aws-glue

asked Nov 24 '19 at 04:25

user3476463

3,967
22
57
117

votes

1 answer

How to mock inner call to pyspark sql function

Got the following piece of pyspark code: import pyspark.sql.functions as F null_or_unknown_count = df.sample(0.01).filter( F.col('env').isNull() | (F.col('env') == 'Unknown') ).count() In test code, the data frame is mocked, so I am trying to…

python apache-spark pyspark mocking python-unittest

asked Nov 01 '19 at 22:02

arun

10,685
6
59
81

votes

2 answers

pandasUDF and pyarrow 0.15.0

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at…

pandas apache-spark pyspark pyarrow

asked Oct 07 '19 at 15:51

ilijaluve

1,050
2
10
24

votes

2 answers

In Apache Spark, how to convert a slow RDD/dataset into a stream?

I'm investigating an interesting case that involves wide transformations (e.g. repartition & join) on a slow RDD or dataset, e.g. the dataset defined by the following code: val ds = sqlContext.createDataset(1 to 100) .repartition(1) …

scala apache-spark apache-spark-sql spark-streaming

asked Aug 13 '19 at 16:47

tribbloid

4,026
14
64
103

Prev 1 2 3

…

99 100 Next