Questions tagged [spark3]

To be used for Apache Spark 3.x

Tag is for all related to Apache Spark 3.0.0 and higher.

This tag is separate from apache-spark tag as this version has breaking changes.

Apache Spark is a unified analytics engine for large-scale data processing.

80 questions
0
votes
1 answer

How to read map in spark3 with java

DataSet person = spark.read.textfile(path).map(Person::new,Encoders.bean(Person.class)) when i tried above it will works in spark2.4(scala-2.11) but in spark3.1.1(scala-2.12) it's shows as ambigous for the type DataSet. And also wherver i use…
Anuradha
  • 1
  • 1
0
votes
1 answer

How to get **add_months** Spark2 behaviour in Spark3

We are migrating a huge codebase from Spark2 to Spark 3.x. In order to make the migration incrementally, some configs were set to legacy to have the same behavior as in Spark 2.x. The function add_months, however, AFAIK does not have a "legacy"…
Diego
  • 1
  • 2
0
votes
1 answer

spark struct streaming writeStream output no data but no error

I have a struct streaming job which reads message from Kafka topic then saves to dbfs. The code is as follows: input_stream = spark.readStream \ .format("kafka") \ .options(**kafka_options) \ .load() \ .transform(create_raw_features) #…
0
votes
1 answer

Jackson databind error with scalatest Flatspec

I was trying to execute the scala test cases in IntelliJ using gradle with spark 3.1.1 & scala 2.12.13. But the scala tests were failing with the below jackson-databind error. val conf = new SparkConf().setMaster("local[2]") val spark =…
vamsi
  • 344
  • 5
  • 22
0
votes
1 answer

Need help migrating from Spark 2.0 to Spark 3.1 - Accumulable to AccumulatorV2

I'm working on adding Spark 3.1 and Scala 2.12 support for Kylo Data-Lake Management Platform. I need help with migrating the following functions: /** * Creates an {@link Accumulable} shared variable with a name for display in the Spark…
SaleemKhair
  • 499
  • 3
  • 12
0
votes
0 answers

Spark UDF: Apply np.sum over a list of values in a data frame and filter values based on threshold

Very knew to using spark for data manipulation and UDF. I have a sample df with different test scores. There are 50 different columns like these. I am trying to define a custom apply function to filter values (total counts in each row) which are…
Hackerds
  • 1,195
  • 2
  • 16
  • 34
0
votes
0 answers

Convert Spark2.2's UDAF to 3.0 Aggregator

I have a already written UDAF in scala using Spark2.4. Since our Databricks cluster was in 6.4 runtime whose support is no more there, we need to move to 7.3 LTS which have the long term support and uses Spark3. UDAF is deprecated in Spark3 and will…
0
votes
1 answer

spark3 crashes with py4j.protocol.Py4JJavaError

I'm trying to migrate from emr-5.28.0(spark 2.4.4) to emr-6.2.0(spark 3.0.1), and the most basic usage of spark crashes no matter what I do. This my test_pyspark.py file: from pyspark.sql import SparkSession spark =…
Ben Siman
  • 53
  • 2
  • 6
0
votes
1 answer

Elasticsearch plugin for PySpark 3.1.1

I used Elasticsearch Spark 7.12.0 with PySpark 2.4.5 successfully. Both read and write were perfect. Now, I'm testing the upgrade to Spark 3.1.1, this integration doesn't work anymore. No code change in PySpark between 2.4.5 & 3.1.1. Is there a…
Sahas
  • 3,046
  • 6
  • 32
  • 53
0
votes
3 answers

How to read such a nested multiline json file into a data frame with Spark/Scala

I have the following json: { "value":[ {"C1":"val1","C2":"val2"}, {"C1":"val1","C2":"val2"}, {"C1":"val1","C2":"val2"} ] } That i am trying to read like this: spark.read .option("multiLine",…
CoolStraw
  • 5,282
  • 8
  • 42
  • 64
0
votes
1 answer

Does Spark 3.0.1 support custom Aggregators on window functions?

I wrote a custom Aggregator (an extension of org.apache.spark.sql.expressions.Aggregator) and Spark invokes it correctly as an aggregating function under group by statement: sparkSession .createDataFrame(...) .groupBy(col("id")) .agg( …
igor
  • 33
  • 3
0
votes
1 answer

Pyspark.ml - Error when loading model and Pipeline

I want to import a trained pyspark model (or pipeline) into a pyspark script. I trained a decision tree model like so: from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.feature import VectorAssembler from…
FVCC
  • 262
  • 2
  • 16
0
votes
1 answer

Apache spark 3.0 with HDP 2.6 stack

We are planning to setup Apache Spark 3.0 outside of existing HDP 2.6 cluster and to submit the jobs using yarn(v2.7) in that cluster without upgrade or modifying. Currently users are using Spark 2.3 which is included in the HDP stack. Goal is to…
0
votes
1 answer

Spark 3 is failing when I try to execute a simple query

I have this table on Hive: CREATE TABLE `mydb`.`raw_sales` ( `combustivel` STRING, `regiao` STRING, `estado` STRING, `jan` STRING, `fev` STRING, `mar` STRING, `abr` STRING, `mai` STRING, `jun` STRING, `jul` STRING, `ago` STRING, `set` STRING, `out`…
Andre Carneiro
  • 708
  • 1
  • 5
  • 27
0
votes
1 answer

find set of keys in Scala map where values overlap

I'm working with a map object in scala where the key is a basket ID and the value is a set of item ID's contained within a basket. The goal is to ingest this map object and compute for each basket, a set of other basket ID's that contain at least…
tyjchen
  • 5
  • 2