Questions tagged [spark3]

To be used for Apache Spark 3.x

Tag is for all related to Apache Spark 3.0.0 and higher.

This tag is separate from apache-spark tag as this version has breaking changes.

Apache Spark is a unified analytics engine for large-scale data processing.

80 questions
2
votes
1 answer

Spark 3 KryoSerializer issue - Unable to find class: org.apache.spark.util.collection.OpenHashMap

I am upgrading a Spark 2.4 project to Spark 3.x. We are hitting a snag with some existing Spark-ml code: var stringIndexers = Array[StringIndexer]() for (featureColumn <- FEATURE_COLS) { stringIndexers = stringIndexers :+ new…
2
votes
1 answer

Apache Livy 0.7.0 Failed to create Interactive session

While creating a new session using apache Livy 0.7.0 I am getting below error. I am also using zeppelin notebook(livy interpreter) to create the session. Using Scala version 2.12.10, Java HotSpot(TM) 64-Bit Server VM, 11.0.11 Spark 3.0.2 zeppelin…
Sushil Behera
  • 817
  • 6
  • 20
2
votes
0 answers

SPARK SQL throws AssertionError: assertion failed: Found duplicate rewrite attributes (Spark 3.0.2)

Executing the above in Spark 3.0.2 produces Exception in thread "main" java.lang.AssertionError: assertion failed: Found duplicate rewrite attributes. It was working in Spark 2.4.3. SELECT COALESCE(view_1_alias.name, view_2.name) AS name, …
KilyenOrs
  • 1,456
  • 1
  • 20
  • 26
2
votes
0 answers

in spark3.0.1,use DataFrame.foreachPartition,value foreach is not a member of Object

in idea version:spark3.0.1,scala2.12.12,java1.8.0_212 my code: val df=spark.range(10) df.foreachParitition(rows=>{ rows.foreach(.......) }) error: value foreach is not a member of Object rows.foreach(row=>{ if use spark2.4.7 and…
ZhiYing
  • 21
  • 2
2
votes
0 answers

Running on external Spark 3.0.1 cluster from IntelliJ

I have recently upgraded to Spark 3.0.1 from 2.4.6 (and scala 2.11.12 to scala 2.12.10). I write and execute applications from IntelliJ Idea and in the past was able to run with both the Master set to local[*] or remotely using spark://xx:7077. My…
TJVR
  • 315
  • 6
  • 15
2
votes
1 answer

Monitoring Spark 3 applications with Prometheus

Have some very basic questions around the pull mechanism with metrics and how Spark 3 applications can be monitored using Prometheus: Does the PrometheusServlet sink supported with Spark-3 contain all the metrics since the application start time?…
soontobeared
  • 441
  • 4
  • 9
  • 30
2
votes
4 answers

PySpark structured Streaming + Kafka Error (Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.sources.v2.StreamWriteSupport )

I am trying to run Python Spark Structured Streaming + Kafka, when I run the command Master@MacBook-Pro spark-3.0.0-preview2-bin-hadoop2.7 % bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.5…
1
vote
0 answers

How to provide hive metastore information via spark-submit?

Using Spark 3.1, I need to provide the hive configuration via the spark-submit command (not inside the code). Inside the code (which is not the solution I need), I can do the following which works fine (able to list database and select from tables.…
Itération 122442
  • 2,644
  • 2
  • 27
  • 73
1
vote
0 answers

Spark AQE drastically reduces number of partitions

I am using spark 3.2.1 to summarise high volume data using joins. Spark's plan shows that 1 executor was tasked with 90GB of data to process after Spark's AEQShuffleRead step as shown below. Also the shuffle partition of 900 was drastically brought…
Gladiator
  • 354
  • 3
  • 19
1
vote
1 answer

Why would finding an aggregate of a partition column in Spark 3 take very long time?

I'm trying to query the MIN(dt) in a table partitioned by dt column using the following query in both Spark2 and Spark3: SELECT MIN(dt) FROM table_name The table is stored in parquet format in S3, where each dt is a separate folder, so this seems…
RyanCheu
  • 3,522
  • 5
  • 38
  • 47
1
vote
0 answers

Spark custom Aggregator with multiple columns

I have written a Spark UDAF that takes as input two columns (timestamp and value) and calculates a rate of change via least squares over all data points in a given window. It works perfectly fine, the code is below (shortened to relevant…
Tim Zimmermann
  • 6,132
  • 3
  • 30
  • 36
1
vote
1 answer

to_date conversion failing in PySpark on Spark 3.0

Having known about calendar change in Spark 3.0, I am trying to understand why the cast is failing in this particular instance. Spark 3.0 has issues with dates before year 1582. However, in this example, year is greater than 1582. rdd =…
1
vote
0 answers

UDF function fails in Spark 3.3.0

I have an application developed with Scala 2.11 and Spark 2.4 where and UDF is applied to a streaming dataframe to add a new column. Due to other library requirements, I have moved the application to Scala 2.12 and Spark 3.3 but now the code fails…
1
vote
1 answer

Spark: DF.as[Type] fails to compile

I'm trying to run an example from the Spark book Spark: The Definitive Guide build.sbt ThisBuild / scalaVersion := "3.2.1" libraryDependencies ++= Seq( ("org.apache.spark" %% "spark-sql" % "3.2.0" %…
Yashwanth
  • 37
  • 1
  • 7
1
vote
1 answer

No TypeTag available for a case class using scala 3 with spark 3

I have my code that runs a spark job with scala 3 @main def startDatasetJob(): Unit = val spark = SparkSession.builder() .appName("Datasets") .master("local[*]") .getOrCreate() case class CarRow(Name: String, …
Liusha He
  • 11
  • 1