0

I am attempting to use the update/Delete/Upsert operation in Pyspark with AWS Glue.

I have instantiated spark with below configs:

spark = SparkSession.builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog").getOrCreate()

If I skip the update/Delete/Upsert operation, the merge, insert (which I assume also requires the DeltaSparkSessionExtension) works just fine. This makes no sense, why does the update operation throw this error but the merge operation does not?

I have tried to perform update and delete using direct transformation and also via spark-sql.

With direct transformation, i am facing the issue:

this delta operation required sparksession to be configured with glue

Note: I have configured spark session with all the required Delta dependencies.

With Spark-sql, I am using the following query:

MERGE INTO delta.`s3a://delta-lake-aws-glue-demo/current/` as superstore
USING delta.`s3a://delta-lake-aws-glue-demo/updates_delta/` as updates
ON superstore.row_id = updates.row_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *

I am facing the below issue for the above query:

AnalysisException: Table does not support reads: delta.`s3a://delta-lake-aws-glue-demo/current/`

Tested out with following jars and the results are:

delta-core_2.11-0.6.1.jar -- deprecated jar
delta-core_2.12-0.8.0.jar -- jar which supports inserts and append.
delta-core_2.12-2.1.0.jar -- An error occurred while calling o103.save. java.lang.NoClassDefFoundError: org/apache/spark/SparkThrowable
delta-core_2.12-1.0.0.jar -- jar which supports inserts and append.
this delta operation required sparksession to be configured with glue

Any help is appreciated. Thanks :)

  • Did you check this blog: https://dev.to/awscommunity-asean/making-your-data-lake-acid-compliant-using-aws-glue-and-delta-lake-gk9 (for glue 2.4) and https://dev.to/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-using-aws-glue-3-0-and-delta-lake-42f0 (for glue 3.0). I havent used / worked on delta lake package, so maybe please check these examples – Yuva Oct 17 '22 at 14:41
  • Yes, gone through both the links and performed the same steps as they mentioned. But facing issues with upserts and deletes. – Arun Kumar N Oct 18 '22 at 06:17
  • What is the spark version ? – whatsinthename Oct 18 '22 at 06:26
  • Tried with both Spark 2.4 and Spark 3.1. Same issue :( – Arun Kumar N Oct 18 '22 at 07:48
  • 1
    Any reason you need to use Glue? The MERGE query you posted here accesses the S3 locations directly, so it doesn't need to use Glue at all. – zsxwing Oct 19 '22 at 05:30
  • Just trying out POC on AWS Glue. Since Glue is serverless, try out using it. – Arun Kumar N Oct 25 '22 at 13:37

0 Answers0