Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like hadoop.

Spark is not tied to the two-stage mapreduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in scala, java, python and r. You can also use Spark interactively from the scala, python and r shells to rapidly query big datasets.

Spark runs on yarn, mesos, kubernetes, standalone, or in the cloud. It can access diverse data sources including hdfs, cassandra, hbase, amazon-s3 and google-cloud-platform.

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

Spark Documentation
Spark Programming Guide - Shows each of these features in each of Spark’s supported languages (Python, Scala, and Java)
Spark-Summit Past Events Online materials of spark training courses and keynotes (please refer to the "PAST EVENTS" tab at the top)
Awesome Spark - Awesome collection of resources by GitHub Apache Spark Community
Mastering Apache Spark 2 - Notes on the internals of Apache Spark, Spark SQL and Spark MLlib
Learning Spark - Lightning-Fast big data analysis
AMP Camp 6 (Berkeley, CA, November 19-20, 2015)
AMP Camp 5 (Berkeley, CA, November 20-21, 2014)
AMP Camp 4 (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, and Tachyon
AMP Camp 3 (Berkeley, CA, Aug 2013)
AMP Camp 2 (Strata Santa Clara, Feb 2013)
AMP Camp 1 (Berkeley, CA, Aug 2012)

81095 questions

votes

1 answer

Spark SQL: Why two jobs for one query?

Experiment I tried the following snippet on Spark 1.6.1. val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files soDF.registerTempTable("so") sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by…

apache-spark apache-spark-sql unsafe parquet

asked Jun 10 '16 at 19:37

Mohitt

2,957
3
29
52

votes

2 answers

Extracting `Seq[(String,String,String)]` from spark DataFrame

I have a spark DF with rows of Seq[(String, String, String)]. I'm trying to do some kind of a flatMap with that but anything I do try ends up throwing java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema…

scala apache-spark dataframe apache-spark-sql

asked May 31 '16 at 18:25

Matti Lyra

12,828
8
49
67

votes

6 answers

Spark-Shell Startup Errors

I am seeing errors when starting spark-shell, using spark-1.6.0-bin-hadoop2.6. This is new behavior that just arose. The upshot of the failures displayed in the log messages below, is that sqlContext is not available (but sc is). Is there some kind…

apache-spark derby

asked May 25 '16 at 16:27

slachterman

1,515
4
17
23

votes

2 answers

What is the Scala case class equivalent in PySpark?

How would you go about employing and/or implementing a case class equivalent in PySpark?

python apache-spark pyspark case-class

asked May 10 '16 at 19:35

conner.xyz

6,273
8
39
65

votes

4 answers

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to…

python amazon-s3 apache-spark pyspark amazon-emr

asked Apr 19 '16 at 00:33

flybonzai

3,763
11
38
72

votes

1 answer

What is the differences between Apache Spark and Apache Apex?

Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT. What are the key differences between these 2 platforms? Questions From a data science perspective, how is it…

apache-spark machine-learning pyspark stream-processing apache-apex

asked Feb 23 '16 at 14:11

GeorgeOfTheRF

8,244
23
57
80

votes

5 answers

local class incompatible Exception: when running spark standalone from IDE

I begin to test spark. I installed spark on my local machine and run a local cluster with a single worker. when I tried to execute my job from my IDE by setting the sparconf as follows: final SparkConf conf = new…

java apache-spark

asked Feb 18 '16 at 15:31

Nesrine Ben mustapha

votes

2 answers

How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset?…

java apache-spark apache-spark-sql apache-spark-dataset

asked Jan 07 '16 at 11:35

Milad Khajavi

2,769
9
41
66

votes

3 answers

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using: data = sqlContext.read.json("s3n://.....") Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the…

sql apache-spark database-partitioning apache-spark-sql

asked Jan 04 '16 at 18:56

pemfir

votes

3 answers

Spark Word2vec vector mathematics

I was looking at the example of Spark site for Word2Vec: val input = sc.textFile("text8").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("country name here",…

apache-spark machine-learning apache-spark-mllib word2vec

asked Dec 09 '15 at 06:38

user3803714

5,269
10
42
61

votes

2 answers

Sparksql filtering (selecting with where clause) with multiple conditions

Hi I have the following issue: numeric.registerTempTable("numeric"). All the values that I want to filter on are literal null strings and not N/A or Null values. I tried these three options: numeric_filtered = numeric.filter(numeric['LOW'] !=…

python sql apache-spark apache-spark-sql pyspark

asked Nov 17 '15 at 01:51

user3803714

5,269
10
42
61

votes

5 answers

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this? DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json"); Or is there a way…

java json apache-spark

asked Nov 14 '15 at 16:55

Abu Sulaiman

1,477
2
18
32

votes

3 answers

Manually calling spark's garbage collection from pyspark

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I…

java python apache-spark garbage-collection pyspark

asked Nov 13 '15 at 09:29

architectonic

2,871
2
21
35

votes

3 answers

Replace null values in Spark DataFrame

I saw a solution here but when I tried it doesn't work for me. First I import a cars.csv file : val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") …

scala apache-spark dataframe

asked Oct 27 '15 at 19:10

Gavin Niu

1,315
4
20
27

votes

3 answers

converting pandas dataframes to spark dataframe in zeppelin

I am new to zeppelin. I have a usecase wherein i have a pandas dataframe.I need to visualize the collections using in-built chart of zeppelin I do not have a clear approach here. MY understanding is with zeppelin we can visualize the data if it is a…

pandas apache-spark dataframe apache-zeppelin

asked Oct 06 '15 at 09:26

Bala

Prev 1 2 3

…

99 100 Next