Questions tagged [apache-spark]

Apache Spark is an open source distributed data processing engine written in Scala providing a unified API and distributed data sets to users for both batch and streaming processing. Use cases for Apache Spark often are related to machine/deep learning and graph processing.

From https://spark.apache.org/:

Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

To run programs faster, Spark offers a general execution model based on the RDD data abstraction that can help optimizing arbitrary long operator graphs, and supports in-memory computing, which lets it query data faster than disk-based engines like .

Spark is not tied to the two-stage paradigm, and promises performance up to 100 times faster than Hadoop MapReduce.

Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited for interactive as well as iterative algorithms in machine learning or graph computing.

Spark can be used to tackle stream processing problems with many approaches (micro-batch processing, continuous processing since 2.3, running SQL queries, windowing on data and on streams, running ML libraries to learn from streamed data, and so on ...).

To make programming faster, Spark provides clean, concise APIs in , , and . You can also use Spark interactively from the , and shells to rapidly query big datasets.

Spark runs on , , , standalone, or in the cloud. It can access diverse data sources including , , , and .

When asking Spark related questions, please don't forget to provide a reproducible example (AKA MVCE) and, when applicable, specify the Spark version you're using (since different versions can often disagree). You can refer to How to make good reproducible Apache Spark examples for general guidelines and suggestions.

Latest version

Release Notes for Stable Releases

Apache Spark GitHub Repository

Recommended reference sources:

81095 questions
16
votes
4 answers

How to select last row and also how to access PySpark dataframe by index?

From a PySpark SQL dataframe like name age city abc 20 A def 30 B How to get the last row.(Like by df.limit(1) I can get first row of dataframe into new dataframe). And how can I access the dataframe rows by index.like row no. 12 or 200 . In…
Satya
  • 5,470
  • 17
  • 47
  • 72
16
votes
1 answer

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2. I…
Nabs
  • 553
  • 5
  • 17
16
votes
1 answer

How to convert ArrayType to DenseVector in PySpark DataFrame?

I'm getting the following error trying to build a ML Pipeline: pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually…
Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
16
votes
1 answer

How to balance my data across the partitions?

Edit: The answer helps, but I described my solution in: memoryOverhead issue in Spark. I have an RDD with 202092 partitions, which reads a dataset created by others. I can manually see that the data is not balanced across the partitions, for…
gsamaras
  • 71,951
  • 46
  • 188
  • 305
16
votes
1 answer

Spark 2.0: Relative path in absolute URI (spark-warehouse)

I'm trying to migrate from Spark 1.6.1 to Spark 2.0.0 and I am getting a weird error when trying to read a csv file into SparkSQL. Previously, when I would read a file from local disk in pyspark I would do: Spark 1.6 df = sqlContext.read \ …
Disco4Ever
  • 1,043
  • 2
  • 11
  • 16
16
votes
3 answers

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

I have two dataframes df1 and df2. Both of them have the following schema: |-- ts: long (nullable = true) |-- id: integer (nullable = true) |-- managers: array (nullable = true) | |-- element: string (containsNull = true) |-- projects: array…
Neel
  • 9,913
  • 16
  • 52
  • 74
16
votes
1 answer

Spark, add new Column with the same value in Scala

I have some problem with the withColumn function in Spark-Scala environment. I would like to add a new Column in my DataFrame like that: +---+----+---+ | A| B| C| +---+----+---+ | 4|blah| 2| | 2| | 3| | 56| foo| 3| |100|null| …
Alessandro
  • 337
  • 1
  • 5
  • 18
16
votes
1 answer

Spark: Read an inputStream instead of File

I'm using SparkSQL in a Java application to do some processing on CSV files using Databricks for parsing. The data I am processing comes from different sources (Remote URL, local file, Google Cloud Storage), and I'm in the habit of turning…
Nate Vaughan
  • 3,471
  • 4
  • 29
  • 47
16
votes
1 answer

How to connect HBase and Spark using Python?

I have an embarrassingly parallel task for which I use Spark to distribute the computations. These computations are in Python, and I use PySpark to read and preprocess the data. The input data to my task is stored in HBase. Unfortunately, I've yet…
Def_Os
  • 5,301
  • 5
  • 34
  • 63
16
votes
4 answers

PySpark computing correlation

I want to use pyspark.mllib.stat.Statistics.corr function to compute correlation between two columns of pyspark.sql.dataframe.DataFrame object. corr function expects to take an rdd of Vectors objects. How do I translate a column of df['some_name']…
VJune
  • 1,195
  • 5
  • 16
  • 26
16
votes
3 answers

Spark SQL broadcast hash join

I'm trying to perform a broadcast hash join on dataframes using SparkSQL as documented here: https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL%20%26%20DataFrames/05%20BroadcastHashJoin%20-%20scala.html In that example,…
user1759848
  • 223
  • 1
  • 3
  • 9
16
votes
1 answer

Spark SQL filter multiple fields

What is the corrent syntax for filtering on multiple columns in the Scala API? If I want to do something like this: dataFrame.filter($"col01" === "something" && $"col02" === "something else") or dataFrame.filter($"col01" === "something" || $"col02"…
gstvolvr
  • 650
  • 1
  • 8
  • 17
16
votes
5 answers

How to filter a Spark dataframe by a boolean column?

I created a dataframe that has the following schema: In [43]: yelp_df.printSchema() root |-- business_id: string (nullable = true) |-- cool: integer (nullable = true) |-- date: string (nullable = true) |-- funny: integer (nullable = true) |--…
Nasreddin
  • 1,509
  • 9
  • 31
  • 36
16
votes
1 answer

Read an unsupported mix of union types from an Avro file in Apache Spark

I'm trying to switch from reading csv flat files to avro files on spark. following https://github.com/databricks/spark-avro I use: import com.databricks.spark.avro._ val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df =…
Zahiro Mor
  • 1,708
  • 1
  • 16
  • 30
16
votes
2 answers

Spark Sql: TypeError("StructType can not accept object in type %s" % type(obj))

I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner. I got a single row from source and converted into List[Strings] and creating schema programatically but while…
ThirdEye
  • 433
  • 1
  • 5
  • 13