Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
17
votes
2 answers

Slow Performance with Apache Spark Gradient Boosted Tree training runs

I'm experimenting with Gradient Boosted Trees learning algorithm from ML library of Spark 1.4. I'm solving a binary classification problem where my input is ~50,000 samples and ~500,000 features. My goal is to output the definition of the resulting…
17
votes
1 answer

How to get word details from TF Vector RDD in Spark ML Lib?

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word. But the results are showing in this format. [,
Srini
  • 3,334
  • 6
  • 29
  • 64
17
votes
3 answers

Spark Launcher waiting for job completion infinitely

I am trying to submit a JAR with Spark job into the YARN cluster from Java code. I am using SparkLauncher to submit SparkPi example: Process spark = new SparkLauncher() …
TomaszGuzialek
  • 861
  • 1
  • 8
  • 15
17
votes
1 answer

Usage of spark DataFrame "as" method

I am looking at spark.sql.DataFrame documentation. There is def as(alias: String): DataFrame Returns a new DataFrame with an alias set. Since 1.3.0 What is the purpose of this method? How is it used? Can there be an example? I have…
Prikso NAI
  • 2,592
  • 4
  • 16
  • 29
17
votes
3 answers

Automatically including jars to PySpark classpath

I'm trying to automatically include jars to my PySpark classpath. Right now I can type the following command and it works: $ pyspark --jars /path/to/my.jar I'd like to have that jar included by default so that I can only type pyspark and also use…
Kamil Sindi
  • 21,782
  • 19
  • 96
  • 120
17
votes
3 answers

Spark off heap memory leak on Yarn with Kafka direct stream

I am running spark streaming 1.4.0 on Yarn (Apache distribution 2.6.0) with java 1.8.0_45 and also Kafka direct stream. I am also using spark with scala 2.11 support. The issue I am seeing is that both driver and executor containers are gradually…
17
votes
1 answer

How do you perform basic joins of two RDD tables in Spark using Python?

How would you perform basic joins in Spark using python? In R you could use merg() to do this. What is the syntax using python on spark for: Inner Join Left Outer Join Cross Join With two tables (RDD) with a single column in each that has a…
invoketheshell
  • 3,819
  • 2
  • 20
  • 35
17
votes
2 answers

How to use spark Java API to read binary file stream from HDFS?

I am writing a component which needs to get the new binary file in a specific HDFS path, so that I can do some online learning based on this data. So, I want to read binary file created by Flume from HDFS in stream. I found several functions…
Ying Tan
  • 191
  • 1
  • 5
17
votes
2 answers

Apache Spark: get elements of Row by name

In a DataFrame object in Apache Spark (I'm using the Scala interface), if I'm iterating over its Row objects, is there any way to extract values by name? I can see how to do some really awkward stuff: def foo(r: Row) = { val ix = (0 until…
Ken Williams
  • 22,756
  • 10
  • 85
  • 147
17
votes
2 answers

Difference between org.apache.spark.ml.classification and org.apache.spark.mllib.classification

I'm writing a spark application and would like to use algorithms in MLlib. In the API doc I found two different classes for the same algorithm. For example, there is one LogisticRegression in org.apache.spark.ml.classification also a…
ailzhang
  • 173
  • 1
  • 4
17
votes
1 answer

How to use constant value in UDF of Spark SQL(DataFrame)

I have a dataframe which includes timestamp. To aggregate by time(minute, hour, or day), I have tried as: val toSegment = udf((timestamp: String) => { val asLong = timestamp.toLong asLong - asLong % 3600000 // period = 1 hour }) val df:…
emesday
  • 6,078
  • 3
  • 29
  • 46
17
votes
5 answers

PySpark & MLLib: Random Forest Feature Importances

I'm trying to extract the feature importances of a random forest object I have trained using PySpark. However, I do not see an example of doing this anywhere in the documentation, nor is it a method of RandomForestModel. How can I extract feature…
Bryan
  • 5,999
  • 9
  • 29
  • 50
17
votes
4 answers

How to get the number of elements in partition?

Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition. Something like this: Rdd.partitions().get(index).size() Except I don't see such an API for spark. Any ideas?…
Geo
  • 173
  • 1
  • 1
  • 5
17
votes
1 answer

How does Apache Spark know about HDFS data nodes?

Imagine I do some Spark operations on a file hosted in HDFS. Something like this: var file = sc.textFile("hdfs://...") val items = file.map(_.split('\t')) ... Because in the Hadoop world the code should go where the data is, right? So my question…
Frizz
  • 2,524
  • 6
  • 31
  • 45
17
votes
7 answers

java.lang.NoClassDefFoundError: org/apache/spark/streaming/twitter/TwitterUtils$ while running TwitterPopularTags

I am a beginner in Spark streaming and Scala. For a project requirement I was trying to run TwitterPopularTags example present in github. As SBT assembly was not working for me and I was not familiar with SBT I am trying to use Maven for building.…
vpv
  • 339
  • 1
  • 2
  • 7