0

I'm trying to exclude Glacier data from the input of my Databricks notebook job (Spark). It basically reads parquet data on S3 through AWS Glue Catalog. I already added excludeStorageClasses to Glue table properties:

|Table Properties | [excludeStorageClasses=[GLACIER], transient_lastDdlTime=1637069663]|

but when I read a table it's still trying to read data in Glacier.

spark.sql("SELECT * FROM test_db.users").count()

The error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 163, 172.19.249.237, executor 0): java.io.IOException: Failed to read job commit marker: S3AFileStatus{path=s3:...

Any ideas how I can make it work or how to exclude Glacier data from input source for Spark job?

Finkelson
  • 2,921
  • 4
  • 31
  • 49

1 Answers1

0

additionalOptions = JsonOptions( Map("excludeStorageClasses" -> List("GLACIER", "DEEP_ARCHIVE")