Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
17
votes
1 answer

Spark SQL: Why two jobs for one query?

Experiment I tried the following snippet on Spark 1.6.1. val soDF = sqlContext.read.parquet("/batchPoC/saleOrder") # This has 45 files soDF.registerTempTable("so") sqlContext.sql("select dpHour, count(*) as cnt from so group by dpHour order by…
Mohitt
  • 2,957
  • 3
  • 29
  • 52
17
votes
2 answers

Extracting `Seq[(String,String,String)]` from spark DataFrame

I have a spark DF with rows of Seq[(String, String, String)]. I'm trying to do some kind of a flatMap with that but anything I do try ends up throwing java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema…
Matti Lyra
  • 12,828
  • 8
  • 49
  • 67
17
votes
6 answers

Spark-Shell Startup Errors

I am seeing errors when starting spark-shell, using spark-1.6.0-bin-hadoop2.6. This is new behavior that just arose. The upshot of the failures displayed in the log messages below, is that sqlContext is not available (but sc is). Is there some kind…
slachterman
  • 1,515
  • 4
  • 17
  • 23
17
votes
2 answers

What is the Scala case class equivalent in PySpark?

How would you go about employing and/or implementing a case class equivalent in PySpark?
conner.xyz
  • 6,273
  • 8
  • 39
  • 65
17
votes
4 answers

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to…
flybonzai
  • 3,763
  • 11
  • 38
  • 72
17
votes
1 answer

What is the differences between Apache Spark and Apache Apex?

Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT. What are the key differences between these 2 platforms? Questions From a data science perspective, how is it…
17
votes
5 answers

local class incompatible Exception: when running spark standalone from IDE

I begin to test spark. I installed spark on my local machine and run a local cluster with a single worker. when I tried to execute my job from my IDE by setting the sparconf as follows: final SparkConf conf = new…
17
votes
2 answers

How to convert DataFrame to Dataset in Apache Spark in Java?

I can convert DataFrame to Dataset in Scala very easy: case class Person(name:String, age:Long) val df = ctx.read.json("/tmp/persons.json") val ds = df.as[Person] ds.printSchema but in Java version I don't know how to convert Dataframe to Dataset?…
Milad Khajavi
  • 2,769
  • 9
  • 41
  • 66
17
votes
3 answers

How to set the number of partitions/nodes when importing data into Spark

Problem: I want to import data into Spark EMR from S3 using: data = sqlContext.read.json("s3n://.....") Is there a way I can set the number of nodes that Spark uses to load and process the data? This is an example of how I process the…
pemfir
  • 365
  • 1
  • 3
  • 10
17
votes
3 answers

Spark Word2vec vector mathematics

I was looking at the example of Spark site for Word2Vec: val input = sc.textFile("text8").map(line => line.split(" ").toSeq) val word2vec = new Word2Vec() val model = word2vec.fit(input) val synonyms = model.findSynonyms("country name here",…
user3803714
  • 5,269
  • 10
  • 42
  • 61
17
votes
2 answers

Sparksql filtering (selecting with where clause) with multiple conditions

Hi I have the following issue: numeric.registerTempTable("numeric"). All the values that I want to filter on are literal null strings and not N/A or Null values. I tried these three options: numeric_filtered = numeric.filter(numeric['LOW'] !=…
user3803714
  • 5,269
  • 10
  • 42
  • 61
17
votes
5 answers

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

I can use the following code to read a single json file but I need to read multiple json files and merge them into one Dataframe. How can I do this? DataFrame jsondf = sqlContext.read().json("/home/spark/articles/article.json"); Or is there a way…
Abu Sulaiman
  • 1,477
  • 2
  • 18
  • 32
17
votes
3 answers

Manually calling spark's garbage collection from pyspark

I have been running a workflow on some 3 Million records x 15 columns all strings on my 4 cores 16GB machine using pyspark 1.5 in local mode. I have noticed that if I run the same workflow again without first restarting spark, memory runs out and I…
architectonic
  • 2,871
  • 2
  • 21
  • 35
17
votes
3 answers

Replace null values in Spark DataFrame

I saw a solution here but when I tried it doesn't work for me. First I import a cars.csv file : val df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") …
Gavin Niu
  • 1,315
  • 4
  • 20
  • 27
17
votes
3 answers

converting pandas dataframes to spark dataframe in zeppelin

I am new to zeppelin. I have a usecase wherein i have a pandas dataframe.I need to visualize the collections using in-built chart of zeppelin I do not have a clear approach here. MY understanding is with zeppelin we can visualize the data if it is a…
Bala
  • 675
  • 2
  • 7
  • 23