Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

2 answers

How to set environment variable in databricks?

Simple question, but I can't find a simple guide on how to set the environment variable in Databricks. Also, is it important to set the environment variable on both the driver and executors (and would you do this via spark.conf)? Thanks

apache-spark environment-variables databricks

asked Jul 02 '19 at 15:44

information_interchange

2,538
6
31
49

votes

4 answers

How to handle small file problem in spark structured streaming?

I have a scenario in my project , where I am reading the kafka topic messages using spark-sql-2.4.1 version. I am able to process the day using structured streaming. Once the data is received and after processed I need to save the data into…

apache-spark apache-spark-sql spark-streaming parquet

asked Jun 10 '19 at 10:23

BdEngineer

2,929
4
49
85

votes

2 answers

Find mean of pyspark array

In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type. Is there a way to find the average of an array without exploding the array out? I have…

apache-spark pyspark apache-spark-sql

asked Apr 03 '19 at 19:05

Aaron Faltesek

votes

6 answers

How to get the value of the location for a Hive table using a Spark object?

I am interested in being able to retrieve the location value of a Hive table given a Spark object (SparkSession). One way to obtain this value is by parsing the output of the location via the following SQL query: describe formatted I…

apache-spark hive apache-spark-sql

asked Jan 06 '19 at 10:27

code

5,294
16
62
113

votes

3 answers

How to solve this error org.apache.spark.sql.catalyst.errors.package$TreeNodeException

I have two procesess each process do 1) connect oracle db read a specific table 2) form dataframe and process it. 3) save the df to cassandra. If I am running both process parallelly , both try to read from oracle and I am getting below error…

apache-spark cassandra databricks datastax-enterprise cassandra-3.0

asked Oct 26 '18 at 14:48

BdEngineer

2,929
4
49
85

votes

1 answer

What is the difference between .sc and .scala file?

I am learning scala and got know that we can save scala file using two extensions, that is my.sc and my.scala. Here is the sample file which i created: my.scala object My { /** Our main function where the action happens */ def main(args:…

scala apache-spark intellij-idea scala-ide

asked Oct 19 '18 at 16:55

KayV

12,987
11
98
148

votes

3 answers

How to optimize partitioning when migrating data from JDBC source?

I am trying to move data from a table in PostgreSQL table to a Hive table on HDFS. To do that, I came up with the following code: val conf = new…

apache-spark jdbc hive apache-spark-sql partitioning

asked Oct 02 '18 at 06:38

Metadata

2,127
9
56
127

votes

4 answers

Spark 2: how does it work when SparkSession enableHiveSupport() is invoked

My question is rather simple, but somehow I cannot find a clear answer by reading the documentation. I have Spark2 running on a CDH 5.10 cluster. There is also Hive and a metastore. I create a session in my Spark program as follows: SparkSession…

apache-spark hive apache-spark-sql hiveql

asked Sep 04 '18 at 15:01

Anthony Arrascue

votes

2 answers

Why does Scala compiler fail with "no ': _*' annotation allowed here" when Row does accept varargs?

I would like to create a Row with multiple arguments without knowing their number. I wrote something like this in Scala: def customRow(balance: Int, globalGrade: Int, indicators: Double*): Row = { Row( balance, …

scala apache-spark apache-spark-sql

asked Sep 03 '18 at 15:34

Baptiste Merliot

votes

2 answers

'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe

I want to pivot a spark dataframe, I refer pyspark documentation, and based on pivot function, the clue is .groupBy('name').pivot('name', values=None). Here's my dataset, In[75]: spDF.show() Out[75]: +-----------+-----------+ |customer_id| …

python pandas apache-spark dataframe pyspark

asked Aug 13 '18 at 11:10

Nabih Bawazir

6,381
7
37
70

votes

2 answers

External Hive Table Refresh table vs MSCK Repair

I have external hive table stored as Parquet, partitioned on a column say as_of_dt and data gets inserted via spark streaming. Now Every day new partition get added. I am doing msck repair table so that the hive metastore gets the newly added…

apache-spark hive hivecontext hive-partitions

asked Aug 06 '18 at 17:40

Ajith Kannan

votes

2 answers

What are the compression types supported in parquet

I was writing data on Hadoop and hive in parquet format using spark. I want to enable compression but i can only find 2 types on compression - snappy and Gzip being used most of the times. Does parquet support any other compression like Deflate and…

apache-spark hadoop hive compression parquet

asked Jul 06 '18 at 05:40

User_qwerty

votes

4 answers

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite: I have a DataFrame with the following schema : |-- id: long (nullable = false) …

scala apache-spark

asked Jun 27 '18 at 10:20

Raphael Roth

26,751
15
88
145

votes

2 answers

basedir must be absolute: ?/.ivy2/local

I'm writing here in a full desperation state... I have 2 users: 1 local user, created in Linux. Works 100% fine, word count works perfectly. Kerberized Cluster. Valid ticket. 1 Active Directory user, can login, but pyspark instruction (same word…

apache-spark pyspark ivy jupyterhub

asked Jun 14 '18 at 15:51

Joao Barreto

votes

1 answer

Storing multiple dataframes of different widths with Parquet?

Does Parquet support storing various data frames of different widths (numbers of columns) in a single file? E.g. in HDF5 it is possible to store multiple such data frames and access them by key. So far it looks from my reading that Parquet does not…

python pandas apache-spark parquet

asked May 21 '18 at 21:10

Turo

1,537
2
21
42

Prev 1 2 3

…

99 100 Next