Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
16
votes
7 answers

spark ssc.textFileStream is not streamining any files from directory

I am trying to execute below code using eclipse (with maven conf) with 2 worker and each have 2 core or also tried with spark-submit. public class StreamingWorkCount implements Serializable { public static void main(String[] args) { …
Kaushal
  • 3,237
  • 3
  • 29
  • 48
16
votes
3 answers

Extract information from a `org.apache.spark.sql.Row`

I have Array[org.apache.spark.sql.Row] returned by sqc.sql(sqlcmd).collect(): Array([10479,6,10], [8975,149,640], ...) I can get the individual values: scala> pixels(0)(0) res34: Any = 10479 but they are Any, not Int. How do I extract them as…
sds
  • 58,617
  • 29
  • 161
  • 278
16
votes
2 answers

Spark SQL: how to cache sql query result without using rdd.cache()

Is there any way to cache a cache sql query result without using rdd.cache()? for examples: output = sqlContext.sql("SELECT * From people") We can use output.cache() to cache the result, but then we cannot use sql query to deal with it. So I want…
lwwwzh
  • 225
  • 1
  • 2
  • 9
16
votes
2 answers

How to configure Apache Spark random worker ports for tight firewalls?

I am using Apache Spark to run machine learning algorithms and other big data tasks. Previously, I was using spark cluster standalone mode running spark master and worker on the same machine. Now, I added multiple worker machines and due to a tight…
Isma Khan
  • 161
  • 1
  • 1
  • 6
16
votes
1 answer

Apache Spark - Dealing with Sliding Windows on Temporal RDDs

I've been working quite a lot with Apache Spark the last few months but now I have received a pretty difficult task, to compute average/minimum/maximum etcetera on a sliding window over a paired RDD where the Key component is a date tag and the…
Johan S
  • 3,531
  • 6
  • 35
  • 63
16
votes
3 answers

How to read multiple gzipped files from S3 into a single RDD?

I have many gzipped files stored on S3 which are organized by project and hour per day, the pattern of the paths of the files is…
shihpeng
  • 5,283
  • 6
  • 37
  • 63
16
votes
3 answers

Spark: run InputFormat as singleton

I'm trying to integrate a key-value database to Spark and have some questions. I'm a Spark beginner, have read a lot and run some samples but nothing too complex. Scenario: I'm using a small hdfs cluster to store incoming messages in a database. The…
cruppstahl
  • 2,447
  • 1
  • 19
  • 25
16
votes
3 answers

Sampling a large distributed data set using pyspark / spark

I have a file in hdfs which is distributed across the nodes in the cluster. I'm trying to get a random sample of 10 lines from this file. in the pyspark shell, I read the file into an RDD using: >>> textFile =…
mgoldwasser
  • 14,558
  • 15
  • 79
  • 103
15
votes
5 answers

Databricks: Issue while creating spark data frame from pandas

I have a pandas data frame which I want to convert into spark data frame. Usually, I use the below code to create spark data frame from pandas but all of sudden I started to get the below error, I am aware that pandas has removed iteritems() but my…
data en
  • 431
  • 1
  • 2
  • 9
15
votes
2 answers

How is ColumnarToRow an efficient operation in Spark

In my understanding columnar format is better for MapReduce tasks. Even for something like selection of some columns, columnar works well as we don't have to load other columns into memory. But in Spark 3.0 I'm seeing this ColumnarToRow operation…
kar09
  • 411
  • 3
  • 11
15
votes
1 answer

Does PySpark code run in JVM or Python subprocess?

I want to understand what is happening under the hood when I run the following script named t1.py with python3 t1.py. Specifically, I have the following questions: What kind of code is submitted to the spark worker node? Is it the python code or a…
Charles Ju
  • 1,095
  • 1
  • 9
  • 28
15
votes
2 answers

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to…
user3476463
  • 3,967
  • 22
  • 57
  • 117
15
votes
1 answer

How to mock inner call to pyspark sql function

Got the following piece of pyspark code: import pyspark.sql.functions as F null_or_unknown_count = df.sample(0.01).filter( F.col('env').isNull() | (F.col('env') == 'Unknown') ).count() In test code, the data frame is mocked, so I am trying to…
arun
  • 10,685
  • 6
  • 59
  • 81
15
votes
2 answers

pandasUDF and pyarrow 0.15.0

I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at…
ilijaluve
  • 1,050
  • 2
  • 10
  • 24
15
votes
2 answers

In Apache Spark, how to convert a slow RDD/dataset into a stream?

I'm investigating an interesting case that involves wide transformations (e.g. repartition & join) on a slow RDD or dataset, e.g. the dataset defined by the following code: val ds = sqlContext.createDataset(1 to 100) .repartition(1) …
tribbloid
  • 4,026
  • 14
  • 64
  • 103