Questions tagged [spark3]

To be used for Apache Spark 3.x

Tag is for all related to Apache Spark 3.0.0 and higher.

This tag is separate from apache-spark tag as this version has breaking changes.

Apache Spark is a unified analytics engine for large-scale data processing.

80 questions
0
votes
1 answer

Not able to read from s3a path from EMR on EKS with pyspark code from jupterlab

Trying to run following code on Pyspark kernel from EMR on EKS(using managed endpoint), I tried to set some s3a related spark config but seems not working from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder \ …
0
votes
0 answers

spark throws errors while reading hive view

Problem statement  A hive view is created using beeline to restrict the users from accessing the original hive table since the data contains sensitive information.  For illustration purpose, let's consider a sensitive table as emp_db.employee with…
0
votes
1 answer

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel

We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe. We are using MSExcelLowFootprintParser parser. code used is val hadoopConf = new Configuration() val spark =…
Ashish Mishra
  • 510
  • 4
  • 18
0
votes
1 answer

Get day of week in Pyspark 3 using date format

In my old Spark2.X code, I had a following line pageviewsDF.groupBy( date_format(col("capturedAt"), "u-E").alias("Day Of Week") ).sum('req') That will give Day of Week as 1-Mon, 2-Tue etc. But now in Spark3 I get an error that u-E not recognised…
chintan s
  • 6,170
  • 16
  • 53
  • 86
0
votes
0 answers

I can't read kudu table in spark3_2.12

I would like to read kudu table in spark 3(spark3_2.12). But, I cant read kudu table , even though I tried so hard. Could you please help me ? I tried to use…
0
votes
0 answers

Migrating from Spark 2.4 to Spark 3: How to convert a class that extends SharedSQLContext to use object SparkSession?

In Spark 2.4, there exists class SharedSQLContext and related APIs have been removed in Spark 3. The equivalent of SharedSQLContext from Spark 2.4 is the SparkSession object in Spark 3. I'm relatively new to Scala/Java, how do I approach converting…
sojim2
  • 1,245
  • 2
  • 15
  • 38
0
votes
0 answers

Migrating from Spark 2.4 to Spark 3. What's Spark 2.4's SharedSQLContext equivalent in Spark 3?

I'm fairly new to java/scala. I'm unable to find SharedSQLContext in Spark 3 repo. How do we generally find the class equivalent in more updated versions? I couldn't find any documentation on this. Thank you! Sample existing class: class…
sojim2
  • 1,245
  • 2
  • 15
  • 38
0
votes
0 answers

SPARK: replacing foreach with flatmap in spark not working

I have a spark3 application which has JavaPairRDD and called the foreach function to iterate though each JavaPairRDD which is working fine. Problem with foreach is I can't return any value as it's associated with VoidFunction. I changed foreach to…
Programmer
  • 117
  • 2
  • 14
0
votes
0 answers

Kerberos ticket cache in Spark

I'm running a PySpark (Spark 3.1.1) application in cluster mode on YARN cluster, which is supposed to process input data and send appropriate kafka messages to a given topic. Data manipulation part is already covered, however I struggle to use…
benji__
  • 47
  • 3
  • 9
0
votes
0 answers

Spark REST API to list running and stopped queries

I am exploring the spark rest API for structured streaming. I have looked the all exposed rest API available in below link. https://spark.apache.org/docs/latest/monitoring.html however, I could not figure out how to get the list of "Active Streaming…
Monu
  • 2,092
  • 3
  • 13
  • 26
0
votes
1 answer

Spark can't connect to DB with built-in connection providers

I'm trying to connect to Postgres follow this document And the document said built-in connection providers. Can anyone help me resolve this, please? ` There is a built-in connection providers for the following databases: DB2 MariaDB MS…
MasterLuV
  • 396
  • 1
  • 17
0
votes
1 answer

Date from week date format: 2022-W02-1 (ISO 8601)

Having a date, I create a column with ISO 8601 week date format: from pyspark.sql import functions as F df = spark.createDataFrame([('2019-03-18',), ('2019-12-30',), ('2022-01-03',), ('2022-01-10',)], ['date_col']) df = df.withColumn( …
ZygD
  • 22,092
  • 39
  • 79
  • 102
0
votes
1 answer

Create a lookup column in pyspark

I am trying to create a new column in a pyspark dataframe that "looks up" the next value in the same dataframe, and duplicates it to all next rows, until the next event happened. I have used used windowing functions as follows, but still no luck…
0
votes
1 answer

Scala: Parse timestamp using spark 3.1.2

I have an Excel-reader, where I put the results in sparks dataframes. I have problems with parsing the timestamps. I have timestamps as strings like Wed Dec 08 10:49:59 CET 2021. I was using spark-sql version 2.4.5 and everything worked fine until I…
Jonas
  • 1,760
  • 1
  • 3
  • 12
0
votes
0 answers

Failed to build Spark 3.2.0 against Hadoop 2.7

I'm building Spark 3.2.0 against Hadoop 2.7 but failed. $ git clone -b v3.2.0 https://github.com/apache/spark $ mv spark spark-3.2.0 $ nohup sh -x dev/make-distribution.sh --tgz -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.2 -Phive -Phive-thriftserver…
AppleCEO
  • 63
  • 7