1

I'm new to AWS Glue and PySpark. Below is a code sample

    glue_context.create_dynamic_frame.from_catalog(
    database = "my_S3_data_set",
    table_name = "catalog_data_table",
    push_down_predicate = my_partition_predicate)

in the guide Managing Partitions for ETL Output in AWS Glue.

Suppose a SQL query to filter the data frame is as below

    select * from catalog_data_table
    where timestamp >= '2018-1-1'

How to do the pre-filtering on AWS Glue?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
seven
  • 173
  • 1
  • 2
  • 8
  • https://stackoverflow.com/questions/57925034/aws-push-down-predicate-not-working-when-reading-hive-partitions/70453286#70453286 – vaquar khan Dec 29 '21 at 16:16

1 Answers1

0

Generally speaking, your data should be partitioned and then you will be able to use these partitioning columns in push_down_predicate expression.

Please take a look at this answer.

Yuriy Bondaruk
  • 4,512
  • 2
  • 33
  • 49