Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
15
votes
2 answers

How to set environment variable in databricks?

Simple question, but I can't find a simple guide on how to set the environment variable in Databricks. Also, is it important to set the environment variable on both the driver and executors (and would you do this via spark.conf)? Thanks
15
votes
4 answers

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into…
BdEngineer
  • 2,929
  • 4
  • 49
  • 85
15
votes
2 answers

Find mean of pyspark array

In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type. Is there a way to find the average of an array without exploding the array out? I have…
Aaron Faltesek
  • 319
  • 2
  • 11
15
votes
6 answers

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query: describe formatted I…
code
  • 5,294
  • 16
  • 62
  • 113
15
votes
3 answers

How to solve this error org.apache.spark.sql.catalyst.errors.package$TreeNodeException

I have two procesess each process do 1) connect oracle db read a specific table 2) form dataframe and process it. 3) save the df to cassandra. If I am running both process parallelly , both try to read from oracle and I am getting below error…
15
votes
1 answer

What is the difference between .sc and .scala file?

I am learning scala and got know that we can save scala file using two extensions, that is my.sc and my.scala. Here is the sample file which i created: my.scala object My { /** Our main function where the action happens */ def main(args:…
KayV
  • 12,987
  • 11
  • 98
  • 148
15
votes
3 answers

How to optimize partitioning when migrating data from JDBC source?

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code: val conf = new…
Metadata
  • 2,127
  • 9
  • 56
  • 127
15
votes
4 answers

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation. I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore. I create a session in my Spark program as follows: SparkSession…
Anthony Arrascue
  • 220
  • 1
  • 2
  • 13
15
votes
2 answers

Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?

I would like to create a Row with multiple arguments without knowing their number. I wrote something like this in Scala: def customRow(balance: Int, globalGrade: Int, indicators: Double*): Row = { Row( balance, …
Baptiste Merliot
  • 841
  • 11
  • 24
15
votes
2 answers

'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe

I want to pivot a spark dataframe, I refer pyspark documentation, and based on pivot function, the clue is .groupBy('name').pivot('name', values=None). Here's my dataset, In[75]: spDF.show() Out[75]: +-----------+-----------+ |customer_id| …
Nabih Bawazir
  • 6,381
  • 7
  • 37
  • 70
15
votes
2 answers

External Hive Table Refresh table vs MSCK Repair

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming. Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added…
Ajith Kannan
  • 812
  • 1
  • 8
  • 30
15
votes
2 answers

What are the compression types supported in parquet

I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and…
User_qwerty
  • 375
  • 1
  • 2
  • 10
15
votes
4 answers

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite: I have a DataFrame with the following schema : |-- id: long (nullable = false) …
Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
15
votes
2 answers

basedir must be absolute: ?/.ivy2/local

I'm writing here in a full desperation state... I have 2 users: 1 local user, created in Linux. Works 100% fine, word count works perfectly. Kerberized Cluster. Valid ticket. 1 Active Directory user, can login, but pyspark instruction (same word…
Joao Barreto
  • 181
  • 1
  • 1
  • 9
15
votes
1 answer

Storing multiple dataframes of different widths with Parquet?

Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not…
Turo
  • 1,537
  • 2
  • 21
  • 42