Questions tagged [spark3]

To be used for Apache Spark 3.x

Tag is for all related to Apache Spark 3.0.0 and higher.

This tag is separate from apache-spark tag as this version has breaking changes.

Apache Spark is a unified analytics engine for large-scale data processing.

80 questions
1
vote
0 answers

Can SPARK 3.1 push metrics on Prometheus? Is there a handler?

I am investigating if spark 3.1 and PPrometheus have push mechanisms between them. I know it's possible to pull but I'd like to send the metrics from Spark to Prometheus.
1
vote
0 answers

Beam spark3-runner conflict with scala version

When trying to use Beam with spark 3.1.2 we are running into this issue : InvalidClassException: scala.collection.mutable.WrappedArray As explained here : https://www.mail-archive.com/issues@spark.apache.org/msg297820.html It's an incompatibility…
syronanm
  • 11
  • 2
1
vote
1 answer

SPARK 3 - Populate value with value from previous rows (lookup)

I am new to SPARK. I have 2 dataframes events and players events dataframe consists of columns event_id| player_id| match_id| impact_score players dataframe consists of columns player_id| player_name| nationality I am merging the two datasets by…
Salva
  • 312
  • 2
  • 11
1
vote
2 answers

Start of the week on Monday in Spark

This is my dataset: from pyspark.sql import SparkSession, functions as F spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([('2021-02-07',),('2021-02-08',)], ['date']) \ .select( F.col('date').cast('date'), …
ZygD
  • 22,092
  • 39
  • 79
  • 102
1
vote
0 answers

Can we set up both Spark2.4 and Spark3.0 in single system?

I have Spark 2.4 installation in my Windows . This is required as my Production env. uses Spark2.4 . Now, i wanted to test Spark3.0 feature Also . So can i install Spark-3.0 binaries ,in same Windows machine without disturbing Spark-2.4 installation…
HimanshuSPaul
  • 278
  • 1
  • 4
  • 19
1
vote
0 answers

Why AQE is not shown?

My code is like sql = ''' SELECT ... FROM a LEFT JOIN b ON ... LEFT JOIN c ON ... LEFT JOIN d ON ... ''' df = spark.sql(sql) (df .repartition('col') .write .format('parquet') .mode('overwrite') .partitionBy('col') .option(...) …
Brad
  • 11
  • 1
  • 2
1
vote
0 answers

Phoenix Driver ClassNotFound in Spark3 Streaming

I am migrating the existing spark streaming application from spark2.3 to spark3.1.1. I have updated below-mentioned spark dependencies org.apache.spark
1
vote
1 answer

How to access Spark DataFrame data in GPU from ML Libraries such as PyTorch or Tensorflow

Currently I am studying the usage of Apache Spark 3.0 with Rapids GPU Acceleration. In the official spark-rapids docs I came across this page which states: There are cases where you may want to get access to the raw data on the GPU, preferably…
deepNdope
  • 179
  • 3
  • 14
1
vote
1 answer

AnalysisException when loading a PipelineModel with Spark 3

I am upgrading my Spark version from 2.4.5 to 3.0.1 and I cannot load anymore the PipelineModel objects that use a "DecisionTreeClassifier" stage. In my code I load several PipelineModel, all the PipelineModel with stages ["CountVectorizer_[uid]",…
Be Chiller Too
  • 2,502
  • 2
  • 16
  • 42
1
vote
2 answers

Spark 3.0 and Cassandra Spark / Python Conenctors: Table is not being created prior to write

I'm currently trying to upgrade my application to Spark 3.0.1. For table creation, I drop and create a table using cassandra-driver, the Python-Cassandra connector. Then I write a dataframe into the table using the spark-cassandra connector. There…
L. Chu
  • 123
  • 3
  • 14
1
vote
2 answers

How to save spark dataset in encrypted format?

I am saving my spark dataset as parquet file in my local machine. I would like to know if there are any ways I could encrypt the data using some encryption algorithm. The code I am using to save my data as parquet file looks something like…
Somesh Dhal
  • 336
  • 2
  • 15
1
vote
1 answer

org.apache.spark.shuffle.FetchFailedException: Connection from server1/xxx.xxx.x.xxx:7337 closed

Highlight I have upgraded Spark and trying to run already present Spark Streaming application (Accepts file names via stream, which are then read from HDFS, transformed using rdd and dataframes operations, finally analysed data set is persisted in…
1
vote
1 answer

Spark binary data source vs sc.binaryFiles

Spark 3.0 enables reading binary data using a new data source: val df = spark.read.format(“binaryFile”).load("/path/to/data") Using previous spark versions you cloud load data using: val rdd = sc.binaryFiles("/path/to/data") Beyond having the…
Yosi Dahari
  • 6,794
  • 5
  • 24
  • 44
1
vote
1 answer

Bootstrapping Spark 3.0.0 on EMR cluster

A few days back Spark 3.0.0 was launched. I would like to use some of these functionalities. The default version for Spark on an EMR cluster now is Spark 2.4.5. I specifically make use of PySpark. My question is: how can I install/bootstrap Spark…
0
votes
0 answers

Why are spark3 dynamic partitions slow to write to hive

Question 1: I have a table with a small amount of data, but there are a lot of dynamic partitions in the daily writes, the original spark2 writes can be solved in only 2 minutes, but after upgrading to spark3 it takes 10 minutes to write completely.…