How to exclude Glacier data from input source when reading it via Spark?

Question

I'm trying to exclude Glacier data from the input of my Databricks notebook job (Spark). It basically reads parquet data on S3 through AWS Glue Catalog. I already added excludeStorageClasses to Glue table properties:

|Table Properties | [excludeStorageClasses=[GLACIER], transient_lastDdlTime=1637069663]|

but when I read a table it's still trying to read data in Glacier.

spark.sql("SELECT * FROM test_db.users").count()

The error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 163, 172.19.249.237, executor 0): java.io.IOException: Failed to read job commit marker: S3AFileStatus{path=s3:...

Any ideas how I can make it work or how to exclude Glacier data from input source for Spark job?

score 0 · Answer 1 · answered Oct 27 '22 at 08:51

0

additionalOptions = JsonOptions( Map("excludeStorageClasses" -> List("GLACIER", "DEEP_ARCHIVE")

answered Oct 27 '22 at 08:51

Shivakumar Akula

1
1

how to add this code when reading parquet files? – Shivakumar Akula Nov 08 '22 at 15:20

How to exclude Glacier data from input source when reading it via Spark?

1 Answers1