Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

4 answers

How to select last row and also how to access PySpark dataframe by index?

From a PySpark SQL dataframe like name age city abc 20 A def 30 B How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In…

python apache-spark pyspark apache-spark-sql

asked Sep 17 '16 at 08:48

Satya

5,470
17
47
72

votes

1 answer

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2. I…

facebook scala apache-spark

asked Aug 22 '16 at 04:15

Nabs

votes

1 answer

How to convert ArrayType to DenseVector in PySpark DataFrame?

I'm getting the following error trying to build a ML Pipeline: pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually…

python apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Aug 18 '16 at 19:02

Evan Zamir

8,059
14
56
83

votes

1 answer

How to balance my data across the partitions?

Edit: The answer helps, but I described my solution in: memoryOverhead issue in Spark. I have an RDD with 202092 partitions, which reads a dataset created by others. I can manually see that the data is not balanced across the partitions, for…

python hadoop apache-spark distributed-computing bigdata

asked Aug 06 '16 at 01:31

gsamaras

71,951
46
188
305

votes

1 answer

Spark 2.0: Relative path in absolute URI (spark-warehouse)

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do: Spark 1.6 df = sqlContext.read \ …

windows apache-spark pyspark apache-spark-sql

asked Jul 30 '16 at 00:25

Disco4Ever

1,043
2
11
16

votes

3 answers

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

apache-spark

asked Jul 29 '16 at 04:34

Neel

9,913
16
52
74

votes

1 answer

Spark, add new Column with the same value in Scala

I have some problem with the withColumn function in Spark-Scala environment. I would like to add a new Column in my DataFrame like that: +---+----+---+ | A| B| C| +---+----+---+ | 4|blah| 2| | 2| | 3| | 56| foo| 3| |100|null| …

scala apache-spark apache-spark-sql

asked Jul 26 '16 at 10:39

Alessandro

votes

1 answer

Spark: Read an inputStream instead of File

I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing. The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning…

java apache-spark apache-spark-sql databricks

asked Jul 20 '16 at 21:13

Nate Vaughan

3,471
4
29
47

votes

1 answer

How to connect HBase and Spark using Python?

I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet…

python apache-spark hbase pyspark apache-spark-sql

asked Jul 19 '16 at 23:27

Def_Os

5,301
5
34
63

votes

4 answers

PySpark computing correlation

I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objects. How do I translate a column of df['some_name']…

python apache-spark pyspark apache-spark-sql apache-spark-mllib

asked Jun 03 '16 at 16:06

VJune

1,195
5
16
26

votes

3 answers

Spark SQL broadcast hash join

I'm trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL%20%26%20DataFrames/05%20BroadcastHashJoin%20-%20scala.html In that example,…

apache-spark apache-spark-sql

asked May 27 '16 at 15:31

user1759848

votes

1 answer

Spark SQL filter multiple fields

What is the corrent syntax for filtering on multiple columns in the Scala API? If I want to do something like this: dataFrame.filter($"col01" === "something" && $"col02" === "something else") or dataFrame.filter($"col01" === "something" || $"col02"…

scala apache-spark apache-spark-sql

asked Apr 27 '16 at 14:57

gstvolvr

votes

5 answers

How to filter a Spark dataframe by a boolean column?

python apache-spark filter apache-spark-sql

asked Apr 22 '16 at 02:56

Nasreddin

1,509
9
31
36

votes

1 answer

Read an unsupported mix of union types from an Avro file in Apache Spark

I'm trying to switch from reading csv flat files to avro files on spark. following https://github.com/databricks/spark-avro I use: import com.databricks.spark.avro._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df =…

scala apache-spark apache-spark-sql spark-avro

asked Apr 20 '16 at 10:39

Zahiro Mor

1,708
1
16
30

votes

2 answers

Spark Sql: TypeError("StructType can not accept object in type %s" % type(obj))

I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner. I got a single row from source and converted into List[Strings] and creating schema programatically but while…

python apache-spark apache-spark-sql

asked Apr 17 '16 at 12:39

ThirdEye

Prev 1 2 3

…

99 100 Next