Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

2 answers

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27…

python apache-spark pyspark apache-spark-sql

asked Dec 27 '17 at 17:38

Markus

3,562
12
48
85

votes

4 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…

csv apache-spark pyspark apache-spark-sql apache-spark-2.0

asked Nov 04 '16 at 00:34

femibyte

3,317
7
34
59

votes

3 answers

How to check Spark Version

I want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. Please help.

apache-spark hadoop cloudera

asked Jul 26 '16 at 10:03

Ironman

1,330
2
19
40

votes

2 answers

Apache Spark -- Assign the result of UDF to multiple dataframe columns

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values,…

python apache-spark pyspark apache-spark-sql user-defined-functions

asked Feb 10 '16 at 18:08

Everaldo Aguiar

4,016
7
26
31

votes

4 answers

How does Spark partition(ing) work on files in HDFS?

I'm working with Apache Spark on a Cluster using HDFS. As far as I understand, HDFS is distributing files on data-nodes. So if a put a "file.txt" on the filesystem, it will be split into partitions. Now I'm calling rdd =…

apache-spark hdfs

asked Mar 12 '15 at 13:48

Degget

votes

6 answers

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default. Here is the default Spark behavior. val numbersDf = Seq( ("123"), ("456"), (null), ("") ).toDF("numbers") val lettersDf = Seq( …

sql scala apache-spark join apache-spark-sql

asked Jan 18 '17 at 20:21

Powers

18,150
10
103
108

votes

9 answers

Importing spark.implicits._ in scala

I am trying to import spark.implicits._ Apparently, this is an object inside a class in scala. when i import it in a method like so: def f() = { val spark = SparkSession().... import spark.implicits._ } It works fine, however i am writing a…

scala apache-spark

asked Aug 25 '16 at 17:17

ShinySpiderdude

1,170
4
14
18

votes

6 answers

How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated…

apache-spark apache-spark-sql

asked Jun 05 '16 at 08:37

Chendur

1,099
1
11
23

votes

8 answers

Why does a job fail with "No space left on device", but df says otherwise?

When performing a shuffle my Spark job fails and says "no space left on device", but when I run df -h it says I have free space left! Why does this happen, and how can I fix it?

apache-spark

asked Sep 07 '14 at 06:44

samthebest

30,803
25
102
142

votes

5 answers

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at…

scala apache-spark apache-spark-sql apache-spark-2.0

asked Aug 31 '17 at 21:55

pathikrit

32,469
37
142
221

votes

6 answers

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

I am trying to figure out why my groupByKey is returning the following: [(0, ), (1, ), (2,…

python apache-spark pyspark

asked Apr 18 '15 at 12:18

theMadKing

2,064
7
32
59

votes

1 answer

aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. My intention is not having to save the output as a new dataframe. My current code is rather simple: encodeUDF = udf(encode_time,…

java scala apache-spark pyspark apache-spark-sql

asked Jan 27 '17 at 09:19

Adiel

1,203
3
18
31

votes

3 answers

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: df.coalesce(1) .write .partitionBy("entity",…

apache-spark apache-spark-sql

asked Jan 14 '16 at 12:26

Patrick McGloin

2,204
1
14
26

votes

10 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…

r apache-spark parquet sparkr

asked May 22 '15 at 17:05

metasim

4,793
3
46
70

votes

11 answers

How to access s3a:// files from Apache Spark?

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including: deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive…

hadoop apache-spark amazon-s3

asked May 21 '15 at 23:24

tribbloid

4,026
14
64
103

Prev 1 2 3

…

99 100 Next