1

New to AWS glue, so pardon my question: Why do I get an error when I don't include a pushdown predicate when creating the dynamic frame. I try to use it without the predicate as I will be using bookmark so only new files will be processed regardless of the date partition.

datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1 ,transformation_ctx = "datasourceDyF")
datasourceDyF.ToDF().show(20)

vs

datasourceDyF = gluecontext.create_dynamic_frame.from_catalog(database=db_name, table_name= table1,transformation_ctx = "datasourceDyF", push_down_predicate = "salesdate = '2020-01-01'")
datasourceDyF.ToDF().show(20)

code 1 is giving this error:

py4j.protocol.Py4JJavaError: An error occurred while calling o76.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times,
 most recent failure: Lost task 0.3 in stage 1.0 (TID 4, xxx.xx.xxx.xx, executor 5):
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
marcia12
  • 159
  • 1
  • 2
  • 12

1 Answers1

0

The

pushdown predicate

is actually good to use while connecting a RDBMS / table , this helps spark to identify which data to be loaded into it's RAM/memory (i.e. there is no point in loading the data which is not required in the downstream system ). The benefits of using this - due to less data execution happens in a much faster way than a full table load.

Now, in your case , your underlaying table could be a partitioned one hence the pushdown predicate was required.

dsk
  • 1,863
  • 2
  • 10
  • 13