Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
64
votes
2 answers

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27…
Markus
  • 3,562
  • 12
  • 48
  • 85
64
votes
4 answers

Reading csv files with quoted fields containing embedded commas

I am reading a csv file in Pyspark as follows: df_raw=spark.read.option("header","true").csv(csv_path) However, the data file has quoted fields with embedded commas in them which should not be treated as commas. How can I handle this in Pyspark ?…
femibyte
  • 3,317
  • 7
  • 34
  • 59
64
votes
3 answers

How to check Spark Version

I want to check the spark version in cdh 5.7.0. I have searched on the internet but not able to understand. Please help.
Ironman
  • 1,330
  • 2
  • 19
  • 40
64
votes
2 answers

Apache Spark -- Assign the result of UDF to multiple dataframe columns

I'm using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in one of the columns (that contains a json string). That will return X values,…
64
votes
4 answers

How does Spark partition(ing) work on files in HDFS?

I'm working with Apache Spark on a Cluster using HDFS. As far as I understand, HDFS is distributing files on data-nodes. So if a put a "file.txt" on the filesystem, it will be split into partitions. Now I'm calling rdd =…
Degget
  • 643
  • 1
  • 6
  • 4
63
votes
6 answers

Including null values in an Apache Spark Join

I would like to include null values in an Apache Spark join. Spark doesn't include rows with null by default. Here is the default Spark behavior. val numbersDf = Seq( ("123"), ("456"), (null), ("") ).toDF("numbers") val lettersDf = Seq( …
Powers
  • 18,150
  • 10
  • 103
  • 108
63
votes
9 answers

Importing spark.implicits._ in scala

I am trying to import spark.implicits._ Apparently, this is an object inside a class in scala. when i import it in a method like so: def f() = { val spark = SparkSession().... import spark.implicits._ } It works fine, however i am writing a…
ShinySpiderdude
  • 1,170
  • 4
  • 14
  • 18
63
votes
6 answers

How to import multiple csv files in a single load?

Consider I have a defined schema for loading 10 csv files in a folder. Is there a way to automatically load tables using Spark SQL. I know this can be performed by using an individual dataframe for each file [given below], but can it be automated…
Chendur
  • 1,099
  • 1
  • 11
  • 23
63
votes
8 answers

Why does a job fail with "No space left on device", but df says otherwise?

When performing a shuffle my Spark job fails and says "no space left on device", but when I run df -h it says I have free space left! Why does this happen, and how can I fix it?
samthebest
  • 30,803
  • 25
  • 102
  • 142
62
votes
5 answers

What are the various join types in Spark?

I looked at the docs and it says the following join types are supported: Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti. I looked at…
pathikrit
  • 32,469
  • 37
  • 142
  • 221
62
votes
6 answers

PySpark groupByKey returning pyspark.resultiterable.ResultIterable

I am trying to figure out why my groupByKey is returning the following: [(0, ), (1, ), (2,…
theMadKing
  • 2,064
  • 7
  • 32
  • 59
61
votes
1 answer

aggregate function Count usage with groupBy in Spark

I'm trying to make multiple operations in one line of code in pySpark, and not sure if that's possible for my case. My intention is not having to save the output as a new dataframe. My current code is rather simple: encodeUDF = udf(encode_time,…
Adiel
  • 1,203
  • 3
  • 18
  • 31
61
votes
3 answers

DataFrame partitionBy to a single Parquet file (per partition)

I would like to repartition / coalesce my data so that it is saved into one Parquet file per partition. I would also like to use the Spark SQL partitionBy API. So I could do that like this: df.coalesce(1) .write .partitionBy("entity",…
Patrick McGloin
  • 2,204
  • 1
  • 14
  • 26
61
votes
10 answers

How do I read a Parquet in R and convert it to an R DataFrame?

I'd like to process Apache Parquet files (in my case, generated in Spark) in the R programming language. Is an R reader available? Or is work being done on one? If not, what would be the most expedient way to get there? Note: There are Java and C++…
metasim
  • 4,793
  • 3
  • 46
  • 70
61
votes
11 answers

How to access s3a:// files from Apache Spark?

Hadoop 2.6 doesn't support s3a out of the box, so I've tried a series of solutions and fixes, including: deploy with hadoop-aws and aws-java-sdk => cannot read environment variable for credentials add hadoop-aws into maven => various transitive…
tribbloid
  • 4,026
  • 14
  • 64
  • 103