Is there a way to know what was the last partition written in S3 table to use for a push down predicate in AWS Glue Job?

Question

I´m trying to read just the last partition written in a table in S3 from a Glue Job reading the Dynamic Frame using the push down predicate.

The table I want to read from gets loaded every day, and therefore a new partition gets created for that daily data.

I have another Glue Job that will read from that table but I want to read just the last data written in that last partition. I don´t want to read the whole table and then get the latest data (big data volumn, inefficiency, cost...), since I could use the push down predicate. The problem is, the value of the last partitions change daily.

I have tried using boto3 to list objects from S3, and the get_partitions function to retrieve the values, I know I can query in Athena:

SELECT partition_key, max(partition_value)
FROM information_schema.__internal_partitions__
WHERE table_schema = <database name>
        AND table_name = <table name>
group by 1

But is there an easier way to achieve this in a Glue Job?

Thanks

Did you ever find a solution to this? I'm using Job Bookmarks but it's still a lot of initial partitions to scan and if I can dynamically build a pushdown predicate that'll help the job to not have to scan so many partitions when evaluating the bookmark. — Michael Black, Feb 12 '22 at 00:55
No, I didn´t find a better solution to it. I had no choice but to include a date parameter to the job as an argument, so that way it will know what daily partition to read from S3. It seems like an easy solution if you´re partitioning based on dates, if not try using the SDK for Glue and the get_partitions function. — Nico Arbar, Feb 13 '22 at 12:17

score 0 · Answer 1 · answered Feb 16 '22 at 16:53

This is the solution I came up with, use Boto3 to look up the job bookmark information. With the latest bookmark information dynamically build a partition predicate to cover yesterday, today, and tomorrow partitions. That way I'm limiting the data scan to only a 3 day range of files. Obviously you can modify it to meet your needs

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import *

from time import strftime
import boto3
from botocore.exceptions import ClientError
from datetime import datetime, timedelta
import json

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Dynamically build the partitionPredicate
predicate_date = datetime.today()
try:
    response = glue_client.get_job_bookmark(JobName=args['JOB_NAME'])
    JobBookmark = json.loads(response["JobBookmarkEntry"]["JobBookmark"])
    predicate_date = datetime.strptime(JobBookmark["datasource0"]["timestamps"]["CURR_RUN_START_TIME"], '%Y-%m-%dT%H:%M:%S.%fZ')
except ClientError as e:
    # raise Exception("boto3 client error in retrieves_details_of_bookmarked_job: " + e.__str__())
    print("No bookmark available.")
except Exception as e:
    # raise Exception("Unexpected error in retrieves_details_of_bookmarked_job: " + e.__str__())
    print("No bookmark available.")

yesterday = predicate_date + timedelta(days = -1)
tomorrow = predicate_date + timedelta(days = 1)

# Having a predicate greatly helps in the performance. It can also be used to establish a bookmark
partitionPredicate = "(partition_0='{year}' OR partition_0='{year_2}' OR partition_0='{year_3}') AND (partition_1='{month}' OR partition_1='{month_2}' OR partition_1='{month_3}') AND (partition_2='{day}' OR partition_2='{day_2}' OR partition_2='{day_3}') ".format(year=predicate_date.strftime("%Y"), month=predicate_date.strftime("%m"), day=predicate_date.strftime("%d"), year_2=tomorrow.strftime("%Y"), month_2=tomorrow.strftime("%m"), day_2=tomorrow.strftime("%d"), year_3=yesterday.strftime("%Y"), month_3=yesterday.strftime("%m"), day_3=yesterday.strftime("%d"))

print(partitionPredicate)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_database", table_name = "my_table", transformation_ctx = "datasource0", push_down_predicate = partitionPredicate)
print('Initial Row Count: '+str(datasource0.count()))

job.commit()

Is there a way to know what was the last partition written in S3 table to use for a push down predicate in AWS Glue Job?

1 Answers1