0

I have @daily_schedule triggered daily at 3 minutes past 12am

When triggered by the scheduled tick at '2021-02-16 00:03:00'

The date input shows '2021-02-15 00:00:00', partition tagged as '2021-02-15'


While if triggered via backfill for partition '2021-02-16'

The date input shows '2021-02-16 00:00:00', partition tagged as '2021-02-16'


Why does the scheduled tick fill the partition a day before? Is there an option to use the datetime of execution instead (without using cron @schedule)? This descrepency is confusing when I perform queries using the timestamp for exact dates

P.S I have tested both scheduled run and backfil run to have the same Timezone.


@solid()
def test_solid(_, date):
    _.log.info(f"Input date: {date}")

@pipeline()
def test_pipeline():
    test_solid()

@daily_schedule(
    pipeline_name="test_pipeline",
    execution_timezone="Asia/Singapore",
    start_date=START_DATE,
    end_date=END_DATE,
    execution_time=time(00, 03),
    # should_execute=four_hourly_fitler
)
def test_schedule_daily(date):
    timestamp = date.strftime("%Y-%m-%d %X")
    return {
        "solids": {
            "test_solid":{
                "inputs": {
                    "date":{
                        "value": timestamp
                    }
                }
            }
        }
    }

Isaac
  • 11
  • 1

2 Answers2

0

Sorry for the trouble here - the underlying assumption that the system is making here is that for schedules on pipelines that are partitioned by date, you don't fill in the partition for a day until that day has finished (i.e. the job filling in the data for 2/15 wouldn't run until the next day on 2/16). This is a common pattern in scheduled ETL jobs, but you're completely right that it's not a given that all schedules will want this behavior, and this is good feedback that we should make this use case easier.

It is possible to make a schedule for a partition in the way that you want, but it's more cumbersome. It would look something like this:


from dagster import PartitionSetDefinition, date_partition_range, create_offset_partition_selector

def partition_run_config(date):
    timestamp = date.strftime("%Y-%m-%d %X")
    return {
        "solids": {
            "test_solid":{
                "inputs": {
                    "date":{
                        "value": timestamp
                    }
                }
            }
        }
    }

test_partition_set = PartitionSetDefinition(
    name="test_partition_set",
    pipeline_name="test_pipeline",
    partition_fn=date_partition_range(start=START_DATE, end=END_DATE, inclusive=True, timezone="Asia/Singapore"),
    run_config_fn_for_partition=partition_run_config,
)

test_schedule_daily = (
    test_partition_set.create_schedule_definition(
        "test_schedule_daily",
        "3 0 * * *",
        execution_timezone="Asia/Singapore",
        partition_selector=create_offset_partition_selector(lambda d:d.subtract(minutes=3)),
    )
)

This is pretty similar to @daily_schedule's implementation, it just uses a different function for mapping the schedule execution time to a partition (subtracting 3 minutes instead of 3 minutes and 1 day - that's the create_offset_partition_selector part).

I'll file an issue for an option to customize the mapping for the partitioned schedule decorators, but something like that may unblock you in the meantime. Thanks for the feedback!

  • Filed https://github.com/dagster-io/dagster/issues/3691 to track this. – Daniel Gibson Feb 16 '21 at 16:45
  • Thanks for the detailed response! Yes you are right that this makes total sense adhering to ETL best practices regarding scheduling. A quick mention on this in the Docs perhaps would be helpful? Nonetheless provision to tweak the delta integer would be really useful too (as you mentioned in the issue) in an open-ended use case with scheduling. Much appreciated! – Isaac Feb 16 '21 at 17:21
0

Just an update on this: We added a 'partition_days_offset' parameter to the 'daily_schedule' decorator (and a similar parameter to the other schedule decorators) that lets you customize this behavior. The default is still to go back 1 day, but setting partition_days_offset=0 will give you the behavior you were hoping for where the execution day is the same as the partition day. This should be live in our next weekly release on 2/18.